Introducing Spark PDF: A Powerful Data Source for Apache Spark

This blog post introduces Spark PDF, a custom data source for Apache Spark that empowers users to seamlessly integrate PDF data into their Spark workflows.

Source Code: https://github.com/StabRise/spark-pdf

Quick Start Jupyter Notebook Spark 3.x.x: PdfDataSource.ipynb

Quick Start Jupyter Notebook Spark 4.0.x: PdfDataSourceSpark4.ipynb

Key Capabilities:

Effortless PDF Ingestion: Read PDF documents directly into Spark DataFrames for efficient data processing and analysis.
Optimized Performance: Leverage lazy per-page reading to minimize memory consumption and maximize processing speed, even for large-scale datasets.
Robust Support for Diverse PDF Formats: Handle a wide range of PDF files, including those with extensive page counts (up to 10,000 pages).
Enhanced OCR Functionality: Extract text from scanned PDF documents with built-in Optical Character Recognition (OCR) capabilities.
Simplified Setup: Eliminate the need for external Tesseract OCR installations – Tesseract OCR is conveniently included within the package.

System Requirements:

Java: Java 8, 11, or 17
Apache Spark: 3.3.2, 3.4.1, 3.5.0, or 4.0.0 (Spark 4.0.0 support commences with version 0.1.11 and necessitates Java 17 and Scala 2.13)
Ghostscript: 9.50 or later (required exclusively for the Ghostscript reader)

Installation:

The binary package is readily available within the Maven Central Repository. Utilize the following Maven coordinates, aligning with your specific Spark version:

Spark 3.5.*:com.stabrise:spark-pdf-spark35_2.12:0.1.11
Spark 3.4.*:com.stabrise:spark-pdf-spark34_2.12:0.1.11
Spark 3.3.*:com.stabrise:spark-pdf-spark33_2.12:0.1.11
Spark 4.0.*:com.stabrise:spark-pdf-spark34_2.13:0.1.11

Contributing to the Project:

We encourage community involvement. If you find Spark PDF valuable, please consider showing your support by starring the project repository on GitHub.

Conclusion:

Spark PDF significantly enhances the capabilities of Apache Spark by providing a robust and efficient mechanism for integrating PDF data into your data pipelines. This empowers data scientists, engineers, and analysts to unlock valuable insights from previously inaccessible PDF sources.

Table of Contents

System Requirements:

Installation:

Contributing to the Project:

Conclusion: