Pdf OCR API

Name: Pdf OCR API
Author: cspnair

by cspnair

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

372 runs

21 users

Try This Actor

Opens on Apify.com

About Pdf OCR API

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

What does this actor do?

Pdf OCR API is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

PDF OCR API - Multi-Model Text Extraction Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models. ## 🌟 Features ### Multi-Model OCR Support Choose from 8 different OCR engines based on your needs: - Google Vision API - High accuracy commercial OCR with excellent language support - DeepSeek OCR - Advanced AI-powered text extraction - Amazon Textract - AWS-powered document analysis optimized for PDFs - Azure AI Vision - Microsoft's computer vision OCR service - OpenAI GPT-4 Vision - State-of-the-art multimodal AI for complex documents - Hugging Face - Open-source transformer models for text extraction - Google Gemini - Latest Google multimodal AI technology - Native (Tesseract.js) - Free, no API key required, runs entirely in-container ### Document Processing Features - ✅ Batch Processing - Process multiple PDFs simultaneously - ✅ Multi-Language Support - English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Korean, Arabic, Danish - ✅ Structure Preservation - Maintain document layout and formatting - ✅ Page Range Selection - Process specific pages or page ranges - ✅ Multiple Output Formats - JSON, Plain Text, or Markdown - ✅ High Resolution - 300 DPI conversion for optimal OCR accuracy - ✅ Metadata Extraction - Extract PDF metadata (title, author, dates) - ✅ Pay-Per-Page Pricing - Fair billing based on actual pages processed (see BILLING.md) ## 📋 Input Parameters ### Required - `ocrModel` - OCR model to use (default: "native") - `pdfUrls` - Array of PDF document URLs to process ### Optional - `language` - Document language (default: "eng") - `preserveFormatting` - Maintain document structure (default: true) - `extractImages` - Extract images from PDF (default: false) - `outputFormat` - Output format: "json", "text", or "markdown" (default: "json") - `pageRange` - Pages to process: "all", "1-5", "1,3,5" (default: "all") ### API Keys (model-specific) - `googleVisionApiKey` - For Google Vision API - `deepseekApiKey` - For DeepSeek OCR - `awsAccessKeyId`, `awsSecretAccessKey`, `awsRegion` - For Amazon Textract - `azureEndpoint`, `azureApiKey` - For Azure AI Vision - `openaiApiKey` - For OpenAI GPT-4 Vision - `huggingfaceApiKey` - For Hugging Face models - `geminiApiKey` - For Google Gemini ## 🚀 Quick Start ### Example Input (Native OCR - No API Key Required) `json { "ocrModel": "native", "pdfUrls": [ "https://example.com/document.pdf" ], "language": "eng", "outputFormat": "json", "pageRange": "all" }` ### Example with Google Vision API `json { "ocrModel": "google-vision", "googleVisionApiKey": "YOUR_API_KEY", "pdfUrls": [ "https://example.com/document.pdf", "https://example.com/another-document.pdf" ], "language": "eng", "preserveFormatting": true, "outputFormat": "markdown" }` ### Process Specific Pages `json { "ocrModel": "native", "pdfUrls": ["https://example.com/large-document.pdf"], "pageRange": "1-5,10,15-20", "outputFormat": "text" }` ## 📤 Output Format ### JSON Output (default) `json { "pdfUrl": "https://example.com/document.pdf", "fileName": "document.pdf", "ocrModel": "native", "language": "eng", "success": true, "extractedAt": "2024-11-04T10:30:00.000Z", "pageCount": 5, "totalCharacters": 12450, "averageConfidence": 0.94, "pages": [ { "pageNumber": 1, "text": "Page 1 content...", "confidence": 0.95, "width": 2480, "height": 3508 } ], "fullText": "Complete document text..." }` ### Text Output `json { "output": "Complete document text as plain string...", "pages": [ { "pageNumber": 1, "text": "Page 1 content..." } ] }` ### Markdown Output `json { "output": "# document.pdf\n\nPages: 5\n\n## Page 1\n\nContent...", "pages": [ { "pageNumber": 1, "markdown": "## Page 1\n\nContent..." } ] }` ## 💡 Use Cases ### Business & Legal - Contract analysis and digitization - Legal document processing - Invoice and receipt extraction - Compliance document archiving ### Academic & Research - Research paper text extraction - Academic document digitization - Literature review automation - Citation extraction ### Content & Publishing - Book digitization - Magazine and newspaper archiving - Historical document preservation - Content migration projects ### Development & Integration - Document management systems - Search and indexing pipelines - Data extraction workflows - Archive digitization projects ## 🔧 Supported Languages - English (eng) - Spanish (spa) - French (fra) - German (deu) - Italian (ita) - Portuguese (por) - Russian (rus) - Chinese Simplified (chi_sim) - Japanese (jpn) - Korean (kor) - Arabic (ara) ## 📊 Model Comparison | Model | Speed | Accuracy | Cost | Best For | |-------|-------|----------|------|----------| | Native (Tesseract) | ⚡⚡⚡ | 85% | Free | Testing, simple docs | | Google Vision | ⚡⚡ | 95% | $$ | Production, multi-language | | Amazon Textract | ⚡⚡ | 96% | $$ | Forms, tables, structured docs | | Azure Vision | ⚡⚡ | 94% | $$ | Enterprise integration | | OpenAI GPT-4 | ⚡ | 94% | $$$ | Complex layouts, handwriting | | Gemini | ⚡⚡ | 93% | $$ | Modern documents | ## 🎯 Best Practices ### For Optimal Results 1. Use high-quality PDF sources (not scanned at low resolution) 2. Select the appropriate language setting 3. Use premium models for complex layouts or handwriting 4. Process pages in batches for large documents 5. Enable formatting preservation for structured documents ### Performance Tips 1. Use page ranges to process only needed pages 2. Batch multiple PDFs in a single run 3. Choose Native OCR for simple, clear documents 4. Use premium models only when necessary ### Cost Optimization 1. Start with Native OCR for testing 2. Use page ranges to avoid processing unnecessary pages 3. Batch process to reduce overhead 4. Monitor API costs for premium models ## 📈 Performance - Processing Speed: 5-30 seconds per page (varies by model) - Concurrent Processing: Up to 10 PDFs simultaneously - Maximum File Size: 100MB per PDF - Supported Formats: PDF (any version) - Resolution: 300 DPI conversion ## 💰 Pricing This actor uses pay-per-event pricing: - $0.01 per PDF processed successfully (configurable) - Failed PDFs are not charged - Events tracked: `pdf_processed` ## 🆘 Support For issues, questions, or feature requests: - Check the Apify documentation - Review the input schema for parameter details - Ensure API keys are valid and have sufficient quota - Verify PDF files are accessible and not corrupted ## 🔄 Version History ### v1.0 - Initial release - Support for 8 OCR models - Multi-language support (12 languages) - Batch processing capabilities - Multiple output formats (JSON, Text, Markdown) - Page range selection - Structure preservation - Pay-per-event pricing ## 📚 Related Actors - Receipt OCR API - Specialized for receipt processing - Invoice OCR API - Optimized for invoice extraction - Form OCR API - Structured form data extraction ## 🔗 Links - Actor on Apify Store - Documentation - Support --- Transform your PDF documents into searchable, structured data! 📄✨

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Pdf OCR API now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: cspnair
Pricing: Paid
Total Runs: 372
Active Users: 21

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support