Spark PDF with Spark Connect

Spark PDF with Spark Connect

Mykola Melnyk
Mykola Melnyk

Machine Learning & Data Processing Expert

Spark Connect

The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark’s DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.

More details about Spark Connect you can read on official documentation.

We have two ways to use Spark PDF with Spark Connect:

  • Load Spark PDF jar on the server side
  • Add Spark PDF jar from the client side

Start Spark Connect with Spark PDF

First way to add Spark PDF package to the Spark Session when starting the Spark Connect server:

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.4,com.stabrise:spark-pdf-spark35_2.12:0.1.15

To use the Spark PDF Datasource with a remote cluster, start a Spark Connect Session:

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()

Next, simply read PDF files from any supported source:

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load("s3a://[bucket_name]/*.pdf")

Add Spark PDF package to the Spark Connect Session

Download the latest version of the Spark PDF jar from Maven Central:

wget https://repo1.maven.org/maven2/com/stabrise/spark-pdf-spark35_2.12/0.1.15/spark-pdf-spark35_2.12-0.1.15.jar

Then add it as an artifact:

spark.addArtifact("spark-pdf-spark35_2.12-0.1.15.jar")

After this, the PDF datasource will be available to read files.

Links