Spark-PDF - Custom DataSource for read PDFsAn Open-Source Data Source for dealing with PDF files in Apache Spark

Open In Colab Quick StartTestMaven Central VersionLicenseCodacy Badge

This project provides a custom data source for Apache Spark, enabling you to read PDF files directly into Spark DataFrames. It’s designed to simplify the process of working with PDFs in distributed data pipelines, whether you're dealing with text-based documents, scanned PDFs, or large files with thousands of pages.

🚀
Databricks Integration

This project now works on Databricks. Check out the Databricks example for more details.

Key Features

  • Read PDFs into DataFrames: Directly load PDF files into Spark DataFrames
  • Lazy Loading: Process PDFs page by page to optimize memory usage and handle large files efficiently.
  • Scala and Python Support: Use the library with both Scala and PySpark APIs.
  • Built-in OCR: Extract text from scanned PDFs using integrated OCR—no need to install or configure Tesseract separately.
  • Large File Support: Handle PDFs with up to 10,000 pages without performance bottlenecks.
  • Spark Connect Compatibility: Works seamlessly with Spark Connect for distributed processing.
👉
ScaleDP Compatibility

Compatible with ScaleDP, an Open-Source Library for Processing Documents using AI/ML in Apache Spark.

How It Works

The library extends Apache Spark’s Data Source API, allowing you to treat PDFs as a native data source. For text-based PDFs, it extracts content directly. For scanned PDFs, the built-in OCR engine processes the images and extracts text.

The lazy loading feature ensures that only the required pages are loaded into memory, making it efficient for large files.

Getting Started

Requirements:

  • Java 8, 11, 17
  • Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
  • Ghostscript 9.50 or later (only for the GhostScript reader)
👉
Support of `Spark` 4.0.0

Spark 4.0.0 is supported in the version 0.1.15 and later (requires Java 17 and Scala 2.13).

Installation

The binary package is available in the Maven Central Repository. To install the package for your version of Apache Spark, use the following Maven coordinates:

  • For Spark 3.5: com.stabrise:spark-pdf-spark35_2.12:0.1.15
  • For Spark 3.4: com.stabrise:spark-pdf-spark34_2.12:0.1.11
  • For Spark 3.3: com.stabrise:spark-pdf-spark33_2.12:0.1.15
  • For Spark 4.0: com.stabrise:spark-pdf-spark40_2.13:0.1.15

Simply add the corresponding dependency to your project’s pom.xml or build configuration.

Configuration Options

  • imageType: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
  • resolution: Resolution for rendering PDF page to the image. Default: "300" dpi.
  • pagePerPartition: Number pages per partition in Spark DataFrame. Default: "5".
  • reader: Supports: pdfBox - based on PdfBox java lib, gs - based on GhostScript (need installation GhostScipt to the system)

DataFrame Output Columns

The DataFrame contains the following columns:

  • path: The path to the PDF file.
  • page_number: The page number within the document.
  • text: The extracted text from the text layer of the PDF page.
  • image: The image representation of the page.
  • document: The OCR-extracted text from the rendered image (calls Tesseract OCR).
  • partition_number: The partition number.

Usage Examples

Scala Example

The following Scala code demonstrates how to read a PDF file into a Spark DataFrame. It sets various options like image type, resolution, pages per partition, and the reader to use (PdfBox in this case):

Scala
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
.appName("Spark PDF Example")
.master("local[*]")
.config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.15")
.getOrCreate()

val df = spark.read.format("pdf")
.option("imageType", "BINARY")
.option("resolution", "200")
.option("pagePerPartition", "2")
.option("reader", "pdfBox")
.load("path to the pdf file(s)")

df.select("path", "document").show()

Python Example

In the Python example, the process is similar, where we use PySpark to load a PDF into a DataFrame. This code shows how to configure the same options for the pdf data source:

Python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master("local[*]") \
.appName("SparkPdf") \
.config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.15") \
.getOrCreate()

df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.load("path to the pdf file(s)")

df.select("path", "document").show()