PaddleOCR VL
by yeekal
Opens on Apify.com
About PaddleOCR VL
What does this actor do?
PaddleOCR VL is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Paddle OCR Layout Parser This Apify Actor provides a powerful interface to the Paddle OCR Layout Parsing API. It allows you to submit an image or a PDF file via a URL and receive structured Markdown content, with all embedded images correctly linked via their absolute URLs. It also provides a visual representation of the parsed layout. ## Features - Supports Images and PDFs: Process various image formats (PNG, JPG, etc.) and multi-page PDF documents. - Smart File Type Detection: Automatically determines the file type from the URL, or you can specify it manually. - Markdown Content Extraction: Extracts the full textual content and structure of the document into clean Markdown. - Layout Visualization: Provides a URL to an image that visually highlights the detected layout structure (titles, paragraphs, figures, tables). - File Size Limit: Protects against oversized files by enforcing a 5MB limit. ## Input The Actor requires the following inputs, which are defined in the Input tab. | Field | Type | Description | | --- | --- | --- | | File URL (fileUrl) | String | Required. A publicly accessible URL to the image or PDF file you want to process. The file size must not exceed 5MB. | | File Type (fileType) | String | The type of the file. It's recommended to leave this as Autodetect. Options: Autodetect, Image, PDF. | ## Output The Actor stores its results in the Apify default dataset. Each item in the dataset corresponds to a page from the input file. ### Output Structure (JSON) json [ { "pageNumber": 1, "processedMarkdown": "## This is the Title\n\nAnd this is a paragraph of text. Here is an image:\n\n<div style=\"text-align: center;\"><img src=\"https://example.com/path/to/image.jpg\" alt=\"Image\" width=\"50%\" /></div>", "layoutImageUrl": "https://example.com/path/to/layout_visualization.jpg", } ] - processedMarkdown: The primary output. Ready-to-render Markdown with absolute image URLs. - layoutImageUrl: A URL to an image visualizing the detected document layout.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try PaddleOCR VL now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- yeekal
- Pricing
- Paid
- Total Runs
- 67
- Active Users
- 5
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support