Spark PDF with Spark Connect

Spark Connect
The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark’s DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
More details about Spark Connect you can read on official documentation.
We have to ways to use Spark PDF with Spark Connect:
- load Spark PDF jar on the server side
- add Spark PDF jar from the client side
Start Spark Connect with Spark PDF
First way to add Spark PDF package to the Spark Session when start Spark Connect server:
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.4,\ com.stabrise:spark-pdf-spark35_2.12:0.1.15
For use Spark PDF Datasource with with remote cluster need to start Spark Connect Session:
from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
And next simple read pdf files for any supported sources:
df = spark.read.format("pdf") \ .option("imageType", "BINARY") \ .option("resolution", "300") \ .option("pagePerPartition", "8") \ .option("reader", "pdfBox") \ .load("s3a://[bucket_name]/*.pdf")
Add Spark PDF package to the Spark Connect Session
Need to download it from the maven central latest version of the Spark PDF jar file:
wget https://repo1.maven.org/maven2/com/stabrise/spark-pdf-spark35_2.12/0.1.15/spark-pdf-spark35_2.12-0.1.15.jar
And add it as artifact:
spark.addArtifact("spark-pdf-spark35_2.12-0.1.15.jar")
After that PDF datasource will be available to read files.