Structured Data Extraction from PDFs with AI

Structured Data Extraction from PDFs with AI

Mykola Melnyk
Mykola Melnyk

Machine Learning & Data Processing Expert

Olga Druchek
Olga Druchek

Data / ML Engineer

In this post, we’ll show you how ScaleDP can elevate your Spark-PDF workflow by making document processing smarter and faster. We’ll specifically focus on how to get started with structured data extraction, which can save you hours of manual work.


What is ScaleDP?

ScaleDP is an open-source library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

More Links:


How to Use ScaleDP for Efficient Structured Data Extraction from PDFs

While Spark-PDF reads PDFs into Spark DataFrames, it doesn't extract structured data. ScaleDP solves this by using pre-trained AI models to automatically extract key information from documents like invoices, contracts, or forms, freeing you to focus on higher-level tasks.

Using ScaleDP with Spark-PDF simplifies the process of extracting structured data from PDFs. Here’s how to get started:

1. Define Your Data Schema for Document Processing

The first step is to define the schema for the data you want to extract from the PDF. For example, if you’re processing invoices, your schema might include fields like hospital name, tax ID, items, and the total amount.

Here's an example of how you can define your schema for an invoice:

class Items(BaseModel):
    date: str
    item: str
    note: str
    debit: str

class InvoiceSchema(BaseModel):
    hospital: str
    tax_id: str
    address: str
    email: str
    phone: str
    items: list[Items]
    total: str

Defining this structure allows ScaleDP's AI models to automatically identify the relevant fields in your document and extract the data accordingly.

2. Create the Spark Pipeline for AI-Powered Data Extraction

Next, set up the Spark pipeline to process the document using AI models. This pipeline defines how to extract data from the PDF and map it to the schema you've defined.

pipeline = PipelineModel(stages=[
    LLMVisualExtractor(
        model="gemini-1.5-flash",
        apiKey="your_key",
        apiBase="https://generativelanguage.googleapis.com/v1beta/",
        schema=InvoiceSchema,
        outputCol="invoice"
    )
])

To connect to an AI model provider like Gemini or OpenAI, ensure that you have the correct API key and base URL. Once the connection is made, the pipeline will automatically extract the data from your PDFs.

3. Process Your PDFs with ScaleDP and Spark-PDF

Once your pipeline is set up, you can process your document through ScaleDP to extract the data:

result = pipeline.transform(df).cache()

ScaleDP uses zero-shot AI models, which means you don’t have to manually tag or label data. Define the schema, and the AI will do the rest.

4. View Your Structured Data in Spark DataFrames or JSON Format

One of the key benefits of ScaleDP is how quickly and easily you can get started. Thanks to the zero-shot nature of the AI models, there's no need to spend time labeling or tagging data manually. Simply define your data structure, and let the AI handle the extraction and structuring.

Once the document is processed, the extracted data will be displayed in a structured format, such as a Spark DataFrame or JSON, for easy analysis or integration with other systems.

Here’s an example of how the data might appear in a DataFrame:

result.select("invoice.data.*").show()

The output might look like this:

hospitaltax_idaddressemailphoneitemstotal
Hope Haven Hospital26-123123855 Howard Stree...hopedutton@hopeha...(123) 456-1238[{10/21/2022, App...1024.50

Alternatively, you can display the data in JSON format:

result.show_json("invoice")

This will provide structured, easy-to-read data ready for further analysis:

{
    "hospital": "Hope Haven Hospital",
    "tax_id": "26-123123",
    "address": "855 Howard Street\nDutton, MI 49316",
    "email": "hopedutton@hopehaven.com",
    "phone": "(123) 456-1238",
    "items": [
        {
            "date": "10/21/2022",
            "item": "Appointment",
            "note": "October 2022",
            "debit": "1,056.25"
        },
        {
            "date": "10/21/2022",
            "item": "Insurance",
            "note": "October 2022",
            "debit": "($105.63)"
        },
        {
            "date": "10/21/2022",
            "item": "Medical Record Request",
            "note": "October 2022",
            "debit": "73.87"
        },
        {
            "date": "10/21/2022",
            "item": "Insurance",
            "note": "October 2022",
            "debit": "00.00"
        }
    ],
    "total": "1024.50"
}

Conclusion

Using ScaleDP and Spark-PDF together allows you to effortlessly extract structured data from PDFs. The key steps include defining your data schema, setting up the Spark pipeline, processing the document, and viewing the extracted data. This process is fast, accurate, and doesn't require manual data tagging, making it ideal for a wide range of applications.

Because ScaleDP is built on top of Apache Spark, it’s also great for processing large datasets. Whether you’re dealing with hundreds or thousands of PDFs, ScaleDP can process them in parallel, making your workflow faster and more efficient.


Project Details:

GitHub Repositories:

Colab Jupyter Notebooks with Usage Samples: