Spark PDF with Spark Connect

March 6, 2025 mykola
Spark PDF DataSource with Spark Connect

Spark Connect

The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark’s DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.

More details about Spark Connect you can read on official documentation.

We have to ways to use Spark PDF with Spark Connect:

  • load Spark PDF jar on the server side
  • add Spark PDF jar from the client side

Start Spark Connect with Spark PDF

First way to add Spark PDF package to the Spark Session when start Spark Connect server:

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.4,\
 com.stabrise:spark-pdf-spark35_2.12:0.1.15

For use Spark PDF Datasource with with remote cluster need to start Spark Connect Session:

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()

And next simple read pdf files for any supported sources:

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load("s3a://[bucket_name]/*.pdf")

Add Spark PDF package to the Spark Connect Session

Need to download it from the maven central latest version of the Spark PDF jar file:

wget https://repo1.maven.org/maven2/com/stabrise/spark-pdf-spark35_2.12/0.1.15/spark-pdf-spark35_2.12-0.1.15.jar

And add it as artifact:

spark.addArtifact("spark-pdf-spark35_2.12-0.1.15.jar")

After that PDF datasource will be available to read files.

Links