The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
If you found useful this project, please give a star to the repository.
Key features:
❎ Read PDF documents to the Spark DataFrame
❎ Works with Scala and Python (PySpark)
❎ Support read PDF files lazy per page
❎ Support big files, up to 10k pages
❎ Support scanned PDF files (call OCR)
❎ No need to install Tesseract OCR, it's included in the package
Requirements
- Java 8, 11, 17
- Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
- Ghostscript 9.50 or later (only for the GhostScript reader)
Spark 4.0.0 is supported in the version 0.1.11
and later (need Java 17 and Scala 2.13).
Installation
Binary package is available in the Maven Central Repository.
- Spark 3.5.*: com.stabrise:spark-pdf-spark35_2.12:0.1.11
- Spark 3.4.*: com.stabrise:spark-pdf-spark34_2.12:0.1.11
- Spark 3.3.*: com.stabrise:spark-pdf-spark33_2.12:0.1.11
- Spark 4.0.*: com.stabrise:spark-pdf-spark40_2.13:0.1.11
Options for the data source:
imageType
: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".resolution
: Resolution for rendering PDF page to the image. Default: "300" dpi.pagePerPartition
: Number pages per partition in Spark DataFrame. Default: "5".reader
: Supports:pdfBox
- based on PdfBox java lib,gs
- based on GhostScript (need installation GhostScipt to the system)
Output Columns in the DataFrame:
The DataFrame contains the following columns:
path
: path to the filepage_number
: page number of the documenttext
: extracted text from the text layer of the PDF pageimage
: image representation of the pagedocument
: the OCR-extracted text from the rendered image (calls Tesseract OCR)partition_number
: partition number
Example of usage
Scala
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("Spark PDF Example") .master("local[*]") .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") .getOrCreate() val df = spark.read.format("pdf") .option("imageType", "BINARY") .option("resolution", "200") .option("pagePerPartition", "2") .option("reader", "pdfBox") .load("path to the pdf file(s)") df.select("path", "document").show()
Python
from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[*]") \ .appName("SparkPdf") \ .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") \ .getOrCreate() df = spark.read.format("pdf") \ .option("imageType", "BINARY") \ .option("resolution", "200") \ .option("pagePerPartition", "2") \ .option("reader", "pdfBox") \ .load("path to the pdf file(s)") df.select("path", "document").show()