Spark PDF on Databricks

Great news. Spark PDF Data Source available to run on the Databricks. It was a long story and I spend about month for success this.

Main issue was with difference of the some classes beetwen community Apache Spark and Databricks Safe Spark.

I was filed issue to the Apache Spark Jira. Difference in one class and one month for fix it.

Thanks for Martin Grund and Alex Otter from the Databricks. They helped to update class in Databricks Safe Spark for able to run PDF Data Source on the Databricks envirenment.

So let's do it step by step now.

Create Cluster

I tested on Databricks runtime 15.4 and 16.0.

Let's create small single node cluster using 15.4 runtime:

Install library

Need manually install Spark Pdf to the cluster.

Maven coordinates: com.stabrise:spark-pdf-spark35_2.12:0.1.16

Load example notebook

I prepared notebook adapted and tested on the Databricks.

Run example

In this notebook we download few example pdf files from the Github store it to the workspace:

Python

import urllib.request

filenames = ["./example1.pdf", "./example2.pdf", "./example3.pdf"]
url = f"https://raw.githubusercontent.com/StabRise/spark-pdf/refs/heads/main/examples/"for f in filenames:
    urllib.request.urlretrieve(url + f.split("/")[-1], f)

And read it using PDF DataSource:

Python

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load([f"file:{os.getcwd()}/{f}"for f in filenames])

And here output:

Table of Contents

Spark PDF on Databricks

Create Cluster

Install library

Load example notebook

Run example

Links