Introducing Spark PDF: A Powerful Data Source for Apache Spark
This blog post introduces Spark PDF, a custom data source for Apache Spark that empowers users to seamlessly integrate PDF data into their Spark workflows.
Key Capabilities:
- Effortless PDF Ingestion: Read PDF documents directly into Spark DataFrames for efficient data processing and analysis.
- Optimized Performance: Leverage lazy per-page reading to minimize memory consumption and maximize processing speed, even for large-scale datasets.
- Robust Support for Diverse PDF Formats: Handle a wide range of PDF files, including those with extensive page counts (up to 10,000 pages).
- Enhanced OCR Functionality: Extract text from scanned PDF documents with built-in Optical Character Recognition (OCR) capabilities.
- Simplified Setup: Eliminate the need for external Tesseract OCR installations – Tesseract OCR is conveniently included within the package.
System Requirements:
- Java: Java 8, 11, or 17
- Apache Spark: 3.3.2, 3.4.1, 3.5.0, or 4.0.0 (Spark 4.0.0 support commences with version 0.1.11 and necessitates Java 17 and Scala 2.13)
- Ghostscript: 9.50 or later (required exclusively for the Ghostscript reader)
Installation:
The binary package is readily available within the Maven Central Repository. Utilize the following Maven coordinates, aligning with your specific Spark version:
- Spark 3.5.*:
com.stabrise:spark-pdf-spark35_2.12:0.1.11
- Spark 3.4.*:
com.stabrise:spark-pdf-spark34_2.12:0.1.11
- Spark 3.3.*:
com.stabrise:spark-pdf-spark33_2.12:0.1.11
- Spark 4.0.*:
com.stabrise:spark-pdf-spark34_2.13:0.1.11
Contributing to the Project:
We encourage community involvement. If you find Spark PDF valuable, please consider showing your support by starring the project repository on GitHub.
Conclusion:
Spark PDF significantly enhances the capabilities of Apache Spark by providing a robust and efficient mechanism for integrating PDF data into your data pipelines. This empowers data scientists, engineers, and analysts to unlock valuable insights from previously inaccessible PDF sources.