Read PDF files from the Databricks Unity Catalog volumes using Spark PDF Datasource

We improved support for the Databricks by adding support for the Unity Catalog in the Spark PDF Data Source.

So for now you can read PDF files from the Volumes in the Unity Catalog using Spark PDF Data Source.

Create Cluster

Spark PDF supports Databricks runtime 15.4 and above.

Install library

Need manually install Spark PDF library to the cluster.

Maven coordinates: com.stabrise:spark-pdf-spark35_2.12:0.1.16

Upload example files

Upload example files to the Databricks Unity Catalog Volume.

Read the PDF files using PDF DataSource

You can use both Scala and Python API(PySpark) to read PDF files from the Unity Catalog Volume.

Please specify your catalog and volume names in the code below:

Python

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load(["/Volumes/{CATALOG_NAME}/default/{VOLUME_NAME}/*.pdf")

df.show()

Example output:

You can found full example in the notebook.

Table of Contents

Read PDF files from the Databricks Unity Catalog volumes using Spark PDF Datasource

Create Cluster

Install library

Upload example files

Read the PDF files using PDF DataSource

Links