Bulk Pdf To Json OCR

Name: Bulk Pdf To Json OCR
Author: gagandeo

by gagandeo

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

8 runs

2 users

Try This Actor

Opens on Apify.com

About Bulk Pdf To Json OCR

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

What does this actor do?

Bulk Pdf To Json OCR is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

PDF to JSON OCR Actor with Gemini AI This Apify Actor converts PDF files to structured JSON data using intelligent text extraction and Google Gemini AI-powered structuring. ## Features - 📄 Hybrid Text Extraction: Automatically detects digital text vs scanned images - 🔍 OCR Support: Uses Tesseract OCR for scanned documents - 🤖 AI Structuring: Powered by Google Gemini 2.0 Flash for intelligent data extraction - 📋 Document Types: Optimized for invoices, receipts, menus, resumes, contracts, brochures, and general documents - ⚡ Bulk Processing: Process multiple PDFs in a single run ## Setup ### 1. Configure Environment Variables Copy the example environment file and add your Gemini API key: `bash cp .env.example .env` Edit `.env` and add your API key: `env GEMINI_API_KEY=AIzaSy...your-actual-key-here GEMINI_MODEL=gemini-2.0-flash-exp` Get your Gemini API key from: https://aistudio.google.com/apikey ### 2. Install Dependencies `bash pip install -r requirements.txt` ### 3. Run the Actor `bash apify run` ## Deploy to Apify `bash apify login apify push` ## Input Configuration ### Required Fields - PDF URLs (`startUrls`): Array of direct PDF file URLs to process ### Optional Fields - Enable AI Structuring (`structureData`): Toggle AI-powered data extraction (default: `false`) - Document Type (`documentType`): Context for AI extraction - `general`, `invoice`, `receipt`, `menu`, `resume`, `contract`, `brochure`, `specification` - Max Pages (`maxPages`): Limit pages processed per PDF (default: `10`) ### Example Input `json { "startUrls": [ { "url": "https://example.com/document.pdf" } ], "structureData": true, "documentType": "invoice", "maxPages": 5 }` ## How It Works 1. Download: Fetches PDF from provided URL 2. Text Extraction: - First attempts digital text extraction (fast) - Falls back to OCR if document is scanned (character density < 50/page) 3. AI Structuring (optional): - Sends extracted text to Google Gemini AI - Returns structured JSON based on document type 4. Data Storage: Pushes results to Apify dataset ## Output Format `json { "url": "https://example.com/document.pdf", "status": "success", "document_type": "invoice", "ai_enabled": true, "ai_model": "gemini-2.0-flash-exp", "is_ocr_scanned": false, "page_count": 3, "raw_text_preview": "First 500 characters of extracted text...", "extracted_data": { "invoice_number": "INV-001", "date": "2025-12-17", "total": "$1,234.56" } }` ## Project Structure text .actor/ ├── actor.json # Actor config: name, version, env vars, runtime settings ├── dataset_schema.json # Structure and representation of data produced by an Actor ├── input_schema.json # Input validation & Console form definition └── output_schema.json # Specifies where an Actor stores its output src/ └── main.py # Actor entry point with PDF processing logic .env # Environment variables (API keys) - DO NOT COMMIT! .env.example # Template for environment variables storage/ # Local storage (mirrors Cloud during development) ├── datasets/ # Output items (JSON objects) ├── key_value_stores/ # Files, config, INPUT └── request_queues/ # Pending crawl requests Dockerfile # Container image definition requirements.txt # Python dependencies For more information, see the Actor definition documentation. ## Dependencies - Apify SDK - Actor runtime framework - pdfplumber - Digital PDF text extraction - pdf2image - Converts PDF pages to images - pytesseract - OCR text recognition - httpx - Async HTTP client for downloading PDFs - google-generativeai - Google Gemini API client - python-dotenv - Environment variable management ## Environment Variables The Actor uses environment variables for configuration. These can be set in the `.env` file for local development: - `GEMINI_API_KEY` - Your Google Gemini API key (required for AI structuring) - `GEMINI_MODEL` - Model to use (default: `gemini-2.0-flash-exp`) For Apify Cloud deployment: Set these as environment variables in the Actor settings on the Apify Console. ## Getting Started For complete information see this article. 1. Copy `.env.example` to `.env` and add your Gemini API key 2. Install dependencies: `pip install -r requirements.txt` 3. Run the Actor: `apify run` ## Deploy to Apify ### Connect Git repository to Apify If you've created a Git repository for the project, you can easily connect to Apify: 1. Go to Actor creation page 2. Click on Link Git Repository button ### Push project on your local machine to Apify You can also deploy the project on your local machine to Apify without the need for the Git repository. 1. Log in to Apify. You will need to provide your Apify API Token to complete this action. `bash apify login` 2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors. `bash apify push` ## Documentation reference To learn more about Apify and Actors, take a look at the following resources: - Apify SDK for JavaScript documentation - Apify SDK for Python documentation - Apify Platform documentation - Join our developer community on Discord

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Bulk Pdf To Json OCR now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: gagandeo
Pricing: Paid
Total Runs: 8
Active Users: 2

Related Actors

Tecdoc Car Parts

by making-data-meaningful

OpenRouter - Unified LLM Interface for ChatGPT, Claude, Gemini

by xyzzy

Google Sheets Import & Export

by lukaskrivka

Send Email

by apify

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support