Website Content Text Extractor

Name: Website Content Text Extractor
Author: smart-digital

by smart-digital

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optiona...

201 runs

16 users

Try This Actor

Opens on Apify.com

About Website Content Text Extractor

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.

What does this actor do?

Website Content Text Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Website Content Text Extractor Apify Actor for extracting visible text content from websites as structured JSON blocks. ## Description Extract clean, visible text from websites as structured blocks. Perfect for content migration, translation workflows, and data analysis. This actor extracts text content that is actually visible to users, with options to exclude headers, footers, cookies, and extract form content. ## Features ✅ Multi-URL Batch Processing: Process multiple URLs in a single run ✅ Viewport Presets: Choose between Desktop (1920x1080), Mobile (375x667), Tablet (768x1024), or custom dimensions ✅ Header/Footer/Cookie Exclusion: Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms ✅ Form Content Extraction: Optional extraction of form content (labels, placeholders, values, dropdown options) ✅ DOM Order Preservation: Text blocks extracted in the order they appear on the page ✅ Code Filtering: Automatically filters out JavaScript, CSS, and code snippets ✅ Deduplication: Removes duplicate text blocks ✅ Configurable Selectors: Customize which elements to include/exclude ✅ Clean JSON Output: Structured output perfect for content analysis, translation, and data migration ## Input `json { "startUrls": [ "https://example.com", "https://example.com/about", "https://example.com/contact" ], "viewportType": "desktop", "viewportWidth": 1920, "viewportHeight": 1080, "excludeHeader": false, "excludeFooter": false, "excludeCookies": false, "excludeSelectors": [], "includeSelectors": [], "minTextLength": 3, "deduplicate": true, "waitForSelector": "", "waitTimeout": 30000, "removeEmptyBlocks": true, "extractForms": false }` ### Parameters - startUrl (optional): Single URL to extract text from (useful for quick tests). Leave empty if you only use startUrls - startUrls (optional): List of additional URLs to process in bulk in a single run. Duplicates and empty lines are ignored. Use this field to process multiple pages in one execution - viewportType (optional, default: "desktop"): Choose a predefined viewport size (desktop, mobile, tablet) or use custom dimensions - viewportWidth (optional, default: 1920): Custom viewport width in pixels (only used when viewportType is "custom") - viewportHeight (optional, default: 1080): Custom viewport height in pixels (only used when viewportType is "custom") - excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually - excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually - excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms - extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. Each form field and dropdown option is extracted as a separate block - excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure - includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted - minTextLength (optional, default: 3): Minimum character length for text blocks - deduplicate (optional, default: true): Remove duplicate text blocks - waitForSelector (optional): CSS selector to wait for before extraction - waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector - removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks - extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. ## Output The actor returns a JSON object with the following structure: `json { "url": "https://example.com", "title": "Page Title", "viewport": { "width": 1920, "height": 1080 }, "textBlocks": [ { "id": "block-1", "text": "First visible text block", "order": 1, "tagName": "h1", "selector": null }, { "id": "block-2", "text": "Second text block", "order": 2, "tagName": "p", "selector": null } ], "statistics": { "totalBlocks": 25, "totalCharacters": 5432, "uniqueBlocks": 23, "excludedElements": 8 } }` ### Output Fields - url: The URL that was processed - title: Page title - viewport: Viewport dimensions used - textBlocks: Array of extracted text blocks, each with: - id: Unique identifier (block-1, block-2, etc.) - text: The extracted text content - order: Order of appearance (1, 2, 3, etc.) - tagName: HTML tag name (h1, p, li, etc.) - selector: CSS selector if extracted from specific selector - statistics: Summary statistics ## Use Cases - Content Translation: Extract clean text blocks for translation workflows - Content Analysis: Analyze visible content without navigation/header/footer noise - SEO Content Extraction: Get only the main content for SEO analysis - Content Migration: Extract content for migration to new platforms - Form Data Extraction: Extract form labels, placeholders, and dropdown options for documentation or analysis ## Header and Footer Exclusion By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the `excludeHeader` and `excludeFooter` options. When enabled, these options use universal selectors compatible with: - WordPress: `.site-header`, `.elementor-location-header`, `.wp-block-navigation`, etc. - Shopify: `.shopify-section-header`, `.shopify-section-footer`, etc. - Webflow: `[class='header']`, `[id='header']`, etc. - Drupal: `.region-header`, `.region-footer`, etc. - Joomla: `.header`, `.footer`, `.moduletable-menu`, etc. - Generic: `header`, `footer`, `nav`, `[role='banner']`, `[role='contentinfo']`, etc. If these options don't work for your specific site structure, you can use the `excludeSelectors` parameter to manually specify CSS selectors. ## Technical Details - Uses Playwright for rendering with configurable viewport sizes - Waits for `networkidle` to ensure all content is loaded - Automatically scrolls pages to load lazy-loaded content - Checks element visibility using `getBoundingClientRect()` and computed styles - Filters out elements with `display: none`, `visibility: hidden`, `opacity: 0` - Filters out JavaScript, CSS, and code snippets automatically - Only includes elements within the viewport bounds - Deduplicates text blocks by normalized (lowercase) content - Maintains DOM order for accurate content structure

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Website Content Text Extractor now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: smart-digital
Pricing: Paid
Total Runs: 201
Active Users: 16

Related Actors

Web Scraper

by apify

Cheerio Scraper

by apify

Website Content Crawler

by apify

Legacy PhantomJS Crawler

by apify

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Website Content Text Extractor

About Website Content Text Extractor

What does this actor do?

Key Features

How to Use

Documentation

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?