Website Content Text Extractor
by smart-digital
Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optiona...
Opens on Apify.com
About Website Content Text Extractor
Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.
What does this actor do?
Website Content Text Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Website Content Text Extractor Apify Actor for extracting visible text content from websites as structured JSON blocks. ## Description Extract clean, visible text from websites as structured blocks. Perfect for content migration, translation workflows, and data analysis. This actor extracts text content that is actually visible to users, with options to exclude headers, footers, cookies, and extract form content. ## Features ✅ Multi-URL Batch Processing: Process multiple URLs in a single run ✅ Viewport Presets: Choose between Desktop (1920x1080), Mobile (375x667), Tablet (768x1024), or custom dimensions ✅ Header/Footer/Cookie Exclusion: Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms ✅ Form Content Extraction: Optional extraction of form content (labels, placeholders, values, dropdown options) ✅ DOM Order Preservation: Text blocks extracted in the order they appear on the page ✅ Code Filtering: Automatically filters out JavaScript, CSS, and code snippets ✅ Deduplication: Removes duplicate text blocks ✅ Configurable Selectors: Customize which elements to include/exclude ✅ Clean JSON Output: Structured output perfect for content analysis, translation, and data migration ## Input json { "startUrls": [ "https://example.com", "https://example.com/about", "https://example.com/contact" ], "viewportType": "desktop", "viewportWidth": 1920, "viewportHeight": 1080, "excludeHeader": false, "excludeFooter": false, "excludeCookies": false, "excludeSelectors": [], "includeSelectors": [], "minTextLength": 3, "deduplicate": true, "waitForSelector": "", "waitTimeout": 30000, "removeEmptyBlocks": true, "extractForms": false } ### Parameters - startUrl (optional): Single URL to extract text from (useful for quick tests). Leave empty if you only use startUrls - startUrls (optional): List of additional URLs to process in bulk in a single run. Duplicates and empty lines are ignored. Use this field to process multiple pages in one execution - viewportType (optional, default: "desktop"): Choose a predefined viewport size (desktop, mobile, tablet) or use custom dimensions - viewportWidth (optional, default: 1920): Custom viewport width in pixels (only used when viewportType is "custom") - viewportHeight (optional, default: 1080): Custom viewport height in pixels (only used when viewportType is "custom") - excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually - excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually - excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms - extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. Each form field and dropdown option is extracted as a separate block - excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure - includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted - minTextLength (optional, default: 3): Minimum character length for text blocks - deduplicate (optional, default: true): Remove duplicate text blocks - waitForSelector (optional): CSS selector to wait for before extraction - waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector - removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks - extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. ## Output The actor returns a JSON object with the following structure: json { "url": "https://example.com", "title": "Page Title", "viewport": { "width": 1920, "height": 1080 }, "textBlocks": [ { "id": "block-1", "text": "First visible text block", "order": 1, "tagName": "h1", "selector": null }, { "id": "block-2", "text": "Second text block", "order": 2, "tagName": "p", "selector": null } ], "statistics": { "totalBlocks": 25, "totalCharacters": 5432, "uniqueBlocks": 23, "excludedElements": 8 } } ### Output Fields - url: The URL that was processed - title: Page title - viewport: Viewport dimensions used - textBlocks: Array of extracted text blocks, each with: - id: Unique identifier (block-1, block-2, etc.) - text: The extracted text content - order: Order of appearance (1, 2, 3, etc.) - tagName: HTML tag name (h1, p, li, etc.) - selector: CSS selector if extracted from specific selector - statistics: Summary statistics ## Use Cases - Content Translation: Extract clean text blocks for translation workflows - Content Analysis: Analyze visible content without navigation/header/footer noise - SEO Content Extraction: Get only the main content for SEO analysis - Content Migration: Extract content for migration to new platforms - Form Data Extraction: Extract form labels, placeholders, and dropdown options for documentation or analysis ## Header and Footer Exclusion By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options. When enabled, these options use universal selectors compatible with: - WordPress: .site-header, .elementor-location-header, .wp-block-navigation, etc. - Shopify: .shopify-section-header, .shopify-section-footer, etc. - Webflow: [class*='header'], [id*='header'], etc. - Drupal: .region-header, .region-footer, etc. - Joomla: .header, .footer, .moduletable-menu, etc. - Generic: header, footer, nav, [role='banner'], [role='contentinfo'], etc. If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors. ## Technical Details - Uses Playwright for rendering with configurable viewport sizes - Waits for networkidle to ensure all content is loaded - Automatically scrolls pages to load lazy-loaded content - Checks element visibility using getBoundingClientRect() and computed styles - Filters out elements with display: none, visibility: hidden, opacity: 0 - Filters out JavaScript, CSS, and code snippets automatically - Only includes elements within the viewport bounds - Deduplicates text blocks by normalized (lowercase) content - Maintains DOM order for accurate content structure
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Website Content Text Extractor now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- smart-digital
- Pricing
- Paid
- Total Runs
- 201
- Active Users
- 16
Related Actors
Web Scraper
by apify
Cheerio Scraper
by apify
Website Content Crawler
by apify
Legacy PhantomJS Crawler
by apify
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support