Introducing Spark PDF: A Powerful Data Source for Apache Spark

Dec. 27, 2024 mykola
None

This blog post introduces Spark PDF, a custom data source for Apache Spark that empowers users to seamlessly integrate PDF data into their Spark workflows.

Key Capabilities:

  • Effortless PDF Ingestion: Read PDF documents directly into Spark DataFrames for efficient data processing and analysis.
  • Optimized Performance: Leverage lazy per-page reading to minimize memory consumption and maximize processing speed, even for large-scale datasets.
  • Robust Support for Diverse PDF Formats: Handle a wide range of PDF files, including those with extensive page counts (up to 10,000 pages).
  • Enhanced OCR Functionality: Extract text from scanned PDF documents with built-in Optical Character Recognition (OCR) capabilities.
  • Simplified Setup: Eliminate the need for external Tesseract OCR installations – Tesseract OCR is conveniently included within the package.

System Requirements:

  • Java: Java 8, 11, or 17
  • Apache Spark: 3.3.2, 3.4.1, 3.5.0, or 4.0.0 (Spark 4.0.0 support commences with version 0.1.11 and necessitates Java 17 and Scala 2.13)
  • Ghostscript: 9.50 or later (required exclusively for the Ghostscript reader)

Installation:

The binary package is readily available within the Maven Central Repository. Utilize the following Maven coordinates, aligning with your specific Spark version:

  • Spark 3.5.*:com.stabrise:spark-pdf-spark35_2.12:0.1.11
  • Spark 3.4.*:com.stabrise:spark-pdf-spark34_2.12:0.1.11
  • Spark 3.3.*:com.stabrise:spark-pdf-spark33_2.12:0.1.11
  • Spark 4.0.*:com.stabrise:spark-pdf-spark34_2.13:0.1.11

Contributing to the Project:

We encourage community involvement. If you find Spark PDF valuable, please consider showing your support by starring the project repository on GitHub.

Conclusion:

Spark PDF significantly enhances the capabilities of Apache Spark by providing a robust and efficient mechanism for integrating PDF data into your data pipelines. This empowers data scientists, engineers, and analysts to unlock valuable insights from previously inaccessible PDF sources.