Pdf Json Extractor

Name: Pdf Json Extractor
Author: p6t_p10n

by p6t_p10n

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for in...

9 runs

2 users

Try This Actor

Opens on Apify.com

About Pdf Json Extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

What does this actor do?

Pdf Json Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

PDF → Structured JSON Extractor (Apify Actor) This Apify Actor extracts structured JSON from PDF files using PDF parsing + optional OCR + LLM-based schema extraction. ## Features - Accepts a `pdfUrl` (HTTP) or `pdfBase64` (string) as input - Extracts raw text using `pdf-parse` and optionally OCR (stub) - Sends the text and a user-provided `schema` to an LLM to return strict JSON - Pushes extraction result to Dataset ## Quick start 1. Update `main.js`'s `callLLM` function to call your chosen LLM provider (OpenAI, Anthropic, Google) 2. (Optional) Implement `runOCR` using Tesseract or a cloud OCR API 3. `apify push` to your Apify account and run the actor with `input.json` ## Example input.json `json { "pdfUrl": "https://example.com/invoice123.pdf", "schema": { "invoice_number": "string", "invoice_date": "date", "total_amount": "number", "items": [{ "name": "string", "qty": "number", "price": "number" }] }, "aiModel": "gpt-4o-mini", "ocr": false, "returnFormat": "json" }` ## Notes - The starter `callLLM` function is a stub for testing and must be replaced with an actual LLM API call before production use. - Consider rate limits and cost of LLM calls. Offer batching or model selection in your product. ## Suggested pricing - Free: 20 PDFs / month - Starter: $19 / month (200 PDFs) - Pro: $49 / month (1000 PDFs) - Business: $149 / month (10k PDFs) ## Validation & LLM retry behavior This Actor now validates the extracted JSON using `ajv` when you provide a JSON Schema as the `schema` input. If the JSON does not validate, the Actor will automatically attempt to repair it by sending a targeted prompt to the LLM (up to 2 repair attempts). LLM calls use `p-retry` with exponential backoff for transient failures (retries on 5xx and rate-limit responses). You can control retry counts and model via the input parameters. ## OCR Options (Tesseract or Google Vision) This Actor supports optional OCR when `ocr` is enabled in the input. You can select the OCR engine via the input `ocrOptions.engine` field. ### `ocrOptions` example `json "ocr": true, "ocrOptions": { "engine": "tesseract" }` or for Google Vision: `json "ocr": true, "ocrOptions": { "engine": "google" }` ### Tesseract (offline) - Uses `tesseract.js` (Node). This allows OCR without external APIs but adds a larger dependency. - No env vars needed. Install dependencies and run the Actor as usual. ### Google Vision (cloud OCR) - Uses Google Vision `DOCUMENT_TEXT_DETECTION` endpoint. Requires `GOOGLE_API_KEY` env var with an API key that has Vision API enabled. - Set the key in environment before running: `bash export GOOGLE_API_KEY="YOUR_GOOGLE_VISION_API_KEY"` ### Behavior notes - The Actor will attempt `pdf-parse` extraction first. If `ocr` is true and extracted text is short or empty, the configured OCR engine will be invoked. - OCR can be slower and more expensive (Google Vision costs), so use it only for scanned PDFs.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Pdf Json Extractor now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: p6t_p10n
Pricing: Paid
Total Runs: 9
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support