An Open-Source Data Source for deal with PDF files in Apache Spark

Open In Colab Qick StartTestMaven Central VersionLicenseCodacy Badge

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

If you found useful this project, please give a star to the repository.

Key features:

❎ Read PDF documents to the Spark DataFrame

❎ Works with Scala and Python (PySpark)

❎ Support read PDF files lazy per page

❎ Support big files, up to 10k pages

❎ Support scanned PDF files (call OCR)

❎ No need to install Tesseract OCR, it's included in the package

Requirements

Spark 4.0.0 is supported in the version 0.1.11 and later (need Java 17 and Scala 2.13).

Installation

Binary package is available in the Maven Central Repository.

Options for the data source:

Output Columns in the DataFrame:

The DataFrame contains the following columns:

Example of usage

Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Spark PDF Example")
  .master("local[*]")
  .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7")
  .getOrCreate()
  
val df = spark.read.format("pdf")
  .option("imageType", "BINARY")
  .option("resolution", "200")
  .option("pagePerPartition", "2")
  .option("reader", "pdfBox")
  .load("path to the pdf file(s)")

df.select("path", "document").show()

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("SparkPdf") \
    .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") \
    .getOrCreate()

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .load("path to the pdf file(s)")

df.select("path", "document").show()