landingai-ade-extractor

landingai-ade-extractor

by clever_fashion

Official LandingAI Agentic Document Extraction (ADE) wrapper for Apify. Turn any PDF or image (invoices, receipts, IDs, forms, contracts, passports) i...

7 runs
1 users
Try This Actor

Opens on Apify.com

About landingai-ade-extractor

Official LandingAI Agentic Document Extraction (ADE) wrapper for Apify. Turn any PDF or image (invoices, receipts, IDs, forms, contracts, passports) into perfect structured JSON in seconds – no prompt engineering needed.

What does this actor do?

landingai-ade-extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

LandingAI ADE Document Extractor Actor An Apify Actor that wraps LandingAI's Agentic Document Extraction (ADE) library to extract structured data from visual documents (PDFs and images) via API. ## Features - 🔍 Intelligent Document Extraction: Extracts structured data from PDFs and images using AI - 📄 Multiple Formats: Supports both PDF documents and image files - 🎯 Custom Instructions: Provide specific instructions for what data to extract - 🖼️ Visual Groundings: Optional saving of grounding images showing where data was extracted - ⚡ Async Processing: Built with async/await for optimal performance - ✅ Comprehensive Tests: Full test coverage with pytest following TDD practices ## Input The Actor accepts the following input parameters: json { "apiKey": "land_sk_YOUR_API_KEY_HERE", "documentUrl": "https://example.com/document.pdf", "documentPath": "/path/to/local/document.pdf", "instructions": "Extract invoice number, date, and total amount", "saveGroundings": true } ### Input Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | apiKey | String | Yes | Your LandingAI API key for authentication | | documentUrl | String | Conditional | URL of the document to process | | documentPath | String | Conditional | Local path to the document file | | instructions | String | No | Custom instructions for extraction (default: "Extract all key information from this document.") | | saveGroundings | Boolean | No | Whether to save grounding images to key-value store (default: false) | | useProxies | Boolean | No | Enable routing API calls through Apify proxies (default: false) | | proxyConfiguration | Object | No | Apify proxy configuration (required if useProxies is true) | Either documentUrl or documentPath must be provided. If both are provided, documentUrl takes priority. ## Output The Actor pushes results to the dataset with the following structure: json { "structured_data": { "invoice_number": "INV-2024-001", "date": "2024-12-05", "total": 1500.00 }, "markdown": "# Invoice INV-2024-001\n\nDate: 2024-12-05\nTotal: $1,500.00", "document_source": "https://example.com/invoice.pdf", "extraction_time_seconds": 2.34, "total_time_seconds": 2.45, "instructions": "Extract invoice number, date, and total amount", "groundings_saved": true } ### Output Fields | Field | Type | Description | |-------|------|-------------| | structured_data | Object | Extracted structured information as key-value pairs | | markdown | String | Document content formatted as markdown | | document_source | String | Source URL or path of the processed document | | extraction_time_seconds | Number | Time taken for the extraction process | | total_time_seconds | Number | Total execution time including I/O | | instructions | String | The instructions used for extraction | | groundings_saved | Boolean | Whether grounding images were saved | ## Grounding Images When saveGroundings is set to true, the Actor saves visual annotations to the key-value store showing where data was extracted from the document. These images are saved with keys like: - grounding-0 - grounding-1 - grounding-2 You can access these images from the Actor's key-value store after the run completes. ## Proxy Support The Actor supports routing LandingAI API calls through Apify's proxy servers. This is useful for: - Rate Limit Management: Distribute requests across multiple IPs - Geographic Restrictions: Access region-specific content - IP Rotation: Avoid blocks from excessive requests ### Using Proxies Enable proxy support by setting useProxies to true: json { "apiKey": "your-api-key", "documentUrl": "https://example.com/document.pdf", "useProxies": true, "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"], "apifyProxyCountry": "US" } } ### Proxy Configuration Options | Option | Description | |--------|-------------| | useApifyProxy | Enable Apify's proxy service (default: true) | | apifyProxyGroups | Proxy groups: ["RESIDENTIAL"], ["DATACENTER"], ["GOOGLE_SERP"] | | apifyProxyCountry | Two-letter country code (e.g., "US", "GB", "DE") | Note: Proxy usage requires an Apify subscription with proxy access enabled. ## Local Development ### Prerequisites - Python 3.9+ - Apify CLI: npm install -g apify-cli - LandingAI API key ### Installation 1. Clone this repository or navigate to the Actor directory: bash cd ade-extractor 2. Install dependencies: bash pip install -r requirements.txt 3. Set up your API key in storage/key_value_stores/default/INPUT.json: json { "apiKey": "land_sk_YOUR_API_KEY_HERE", "documentUrl": "https://example.com/sample.pdf", "instructions": "Extract key information", "saveGroundings": false } ### Running Locally bash apify run ### Running Tests The Actor includes comprehensive unit tests following TDD (Test-Driven Development) practices: bash # Run all tests python -m pytest tests/ -v # Run with coverage python -m pytest tests/ --cov=src --cov-report=html # Run specific test file python -m pytest tests/test_extraction.py -v Test coverage includes: - ✅ Successful document extraction from URLs - ✅ Successful document extraction from local files - ✅ Invalid API key handling - ✅ Missing document source validation - ✅ Saving grounding images - ✅ Empty groundings handling - ✅ Input validation scenarios - ✅ Complete extraction workflows - ✅ Grounding image workflows ## Deployment ### Deploy to Apify Platform 1. Authenticate with Apify: bash apify login 2. Deploy the Actor: bash apify push ## Error Handling The Actor includes comprehensive error handling for: - Invalid API Key: Returns error with message about checking API key - Missing Required Parameters: Validates all required inputs before processing - API Connection Errors: Catches and reports connection issues with LandingAI API - Document Access Errors: Handles cases where document URL/path is inaccessible - Unexpected Errors: Catches and logs all unexpected errors with full details ## API Rate Limits Please be aware of LandingAI API rate limits and quotas. The Actor processes documents asynchronously but respects API limitations. ## Best Practices 1. Use Specific Instructions: The more specific your extraction instructions, the better the results 2. Enable Groundings for Debugging: Turn on saveGroundings when testing to verify extraction accuracy 3. Handle Large Documents: For large PDFs, be aware of processing time and API timeouts 4. Secure API Keys: Always store API keys as secrets, never hardcode them ## Architecture The Actor follows Apify best practices: - Async/Await: All I/O operations use async for optimal performance - Input Validation: Early validation of all input parameters - Error Handling: Comprehensive error handling with descriptive messages - Logging: Detailed logging using Apify's logging system - Storage: Proper use of Dataset and Key-Value Store - Docker Compatible: No hardcoded local paths, fully containerized ## Technology Stack - Python 3.9+: Modern async Python - Apify SDK: Official Apify Python SDK - LandingAI ADE: Agentic Document Extraction library - Pytest: Test framework with async support - asyncio: Asynchronous I/O support ## Development Approach This Actor was developed following Test-Driven Development (TDD) principles: 1. ✅ Written comprehensive documentation and comments 2. ✅ Created failing unit tests (red phase) 3. ✅ Implemented minimal code to pass tests (green phase) 4. ✅ Refactored while maintaining test coverage All 11 tests pass with comprehensive coverage of core functionality. ## License This Actor is provided as-is for use with the Apify platform and LandingAI services. ## Support For issues related to: - Actor functionality: Create an issue in this repository - LandingAI API: Contact LandingAI support - Apify platform*: Visit Apify documentation ## Related Resources - LandingAI ADE Documentation - Apify Documentation - Apify SDK Python - LandingAI ADE PyPI Package ## Version History ### 1.0.0 (2024-12-05) - Initial release - Support for PDF and image document extraction - Grounding image storage - Comprehensive test suite - Full async/await support - Docker-compatible implementation

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try landingai-ade-extractor now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
clever_fashion
Pricing
Paid
Total Runs
7
Active Users
1
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support