Universal Web Extractor V8

by motivational_nickel

Flexible web extractor using Python + Playwright or HTTP. Supports CSS-based field extraction, HTML snapshots, screenshots, metadata, monitoring mode,...

299 runs

10 users

Try This Actor

Opens on Apify.com

About Universal Web Extractor V8

Flexible web extractor using Python + Playwright or HTTP. Supports CSS-based field extraction, HTML snapshots, screenshots, metadata, monitoring mode, and link-following. Ideal for scraping product pages, listings, news articles, tech profiles, or universal structured data from any website.

What does this actor do?

Universal Web Extractor V8 is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

🟦 Universal Web Extractor V8 Python Edition — HTTPX + BeautifulSoup A fast, lightweight universal web scraper that fetches webpages over HTTP, parses HTML using BeautifulSoup, and returns clean, structured data — including title, description, and full text — without launching a browser. This Actor is designed for speed, low cost, and simplicity, making it ideal for APIs, SEO pipelines, metadata extraction, and content analysis. 🚀 When to Use This Actor Use Universal Web Extractor V8 (HTTP version) when: Pages are static HTML (no JavaScript rendering required) You need fast, low-cost scraping You want clean text content from webpages You are building SEO, research, or content pipelines For JavaScript-heavy websites, use the Playwright edition of this Actor instead. 🚀 When to Use This Actor Use Universal Web Extractor V8 (HTTP version) when: Pages are static HTML (no JavaScript rendering required) You need fast, low-cost scraping You want clean text content from webpages You are building SEO, research, or content pipelines For JavaScript-heavy websites, use the Playwright edition of this Actor instead. 🧠 How It Works Actor loads start_urls from input For each URL: Sends an HTTP request using httpx Parses HTML with BeautifulSoup Extracts: Title Description Cleaned full text Pushes results to a flat JSON dataset No browser. No JavaScript rendering. Maximum speed. 📥 Input Example { "start_urls": [ "https://example.com", "https://quotes.toscrape.com/" ] } 📤 Output Example { "url": "https://example.com", "title": "Example Domain", "description": "This domain is for use in illustrative examples.", "text_content": "Example Domain This domain is for use in illustrative examples...", "timestamp": "2025-01-01T12:00:00Z" } 🧪 Best Practices Use for static HTML pages Ideal for: Articles Blogs Documentation Product descriptions SEO metadata scraping Batch URLs for maximum efficiency ❗ Limitations ❌ Cannot render JavaScript ❌ Not suitable for SPAs (React, Vue, Angular) ❌ No auto-pagination (HTTP-only version) ❌ No selector-based structured extraction (yet) 💡 Tips If a site requires JavaScript → use the Playwright version Combine with downstream Actors for: Data cleaning NLP Embeddings Indexing 🔧 Changelog v0.0.9 — Python HTTP / BeautifulSoup Edition Added httpx + BeautifulSoup extraction core Automatic title, description, and text extraction clean_html() helper for readable output Simplified input schema (start_urls only) Flat output schema (URL + timestamp + fields) Ready for QA, Spotlight, and Challenge evaluation 🏆 Why This Actor Exists This Actor focuses on speed, reliability, and simplicity — doing one thing extremely well: extract clean content from webpages with minimal cost and maximum performance.