An Open-Source Library for Processing Documents using AI/ML in Apache Spark.

GitHubStabRiseCodacy Badge

Generative AI capabilities

LLM based OCR

  • Accurate
  • Multilingual

LLM Data Extraction

  • Zero-shot data extraction
  • Declarative (possibility to define schema)
  • Scalable

Visual LLM data extraction

  • Extract data from the images
  • Zero shot
  • Accurate
  • Declarative

LLM Ner

  • Zero-shot NER

Supported OCR engines

Tesseract OCR

  • Fast
  • Most popular

Easy OCR

Ready-to-use OCR with 80+ languages support.

Surya OCR

  • OCR in 90+ languages that benchmarks favorably vs cloud services
  • Line-level text detection in any language

DocTR OCR

Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2 & PyTorch