Ai Content Scraper Cleaner
by dashjeevanthedev
AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON data...
Opens on Apify.com
About Ai Content Scraper Cleaner
AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.
What does this actor do?
Ai Content Scraper Cleaner is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
AI Content Scraper & Cleaner An Apify Actor that scrapes structured content (documentation, articles, FAQs, blog posts) and automatically converts it into clean, normalized JSON datasets suitable for LLM training and fine-tuning. ## 🚀 Features - Intelligent Content Extraction: Automatically extracts main content using configurable CSS selectors - Content Type Detection: Automatically detects content types (FAQ, article, guide, documentation, blog) - Text Cleaning: Removes HTML tags, scripts, styles, and normalizes whitespace - Token Estimation: Estimates token counts for LLM training (useful for dataset planning) - Language Detection: Optional language filtering support - Respectful Crawling: Honors robots.txt and implements rate limiting - Proxy Support: Built-in Apify Proxy integration for reliable scraping - Structured Output: Clean JSON dataset items with metadata ## 📋 Input Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | startUrls | string | - | Comma-separated list of URLs to start crawling from (required) | | maxRequestsPerCrawl | string | "50" | Maximum number of requests allowed for this run | | contentSelectors | string | "article, .doc-content, .post-content" | Comma-separated CSS selectors for main content extraction | | titleSelectors | string | "h1, .post-title" | Comma-separated CSS selectors for title extraction | | minimumTextLength | string | "300" | Ignore content shorter than this many characters | | contentType | string | "auto" | Content type override (auto, faq, article, guide, documentation, blog, other) | | maxDepth | string | "2" | Maximum link-following depth from start URL | | respectRobotsTxt | string | "true" | Whether to honor robots.txt rules | | useProxy | string | "true" | Rotate proxies via Apify proxy when available | | language | string | "" | Optional language code filter (e.g., "en") | ## 📤 Output The Actor outputs structured JSON dataset items with the following fields: - url: Source URL of the scraped content - title: Extracted page title - content: Cleaned text content (HTML removed, normalized) - contentType: Detected or specified content type - tokensEstimate: Estimated token count for LLM training - language: Detected or specified language code - extractedAt: ISO timestamp of extraction ## 🛠️ Installation & Usage ### Prerequisites - Node.js 20+ installed - Apify CLI installed (Installation Guide) ### Local Development 1. Clone or navigate to the Actor directory: bash cd AI-Ready-Dataset 2. Install dependencies: bash npm install 3. Configure input: Edit input.json with your target URLs: json { "startUrls": "https://example.com/docs, https://example.com/blog", "maxRequestsPerCrawl": "100", "minimumTextLength": "300" } 4. Run locally: bash apify run 5. View results: Check storage/datasets/default/ for scraped data ### Deploy to Apify Cloud 1. Authenticate: bash apify login 2. Deploy: bash apify push 3. Run on Apify: - Use the Apify Console UI, or - Use CLI: apify call <actor-id> ## 📝 Example Input json { "startUrls": "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors", "maxRequestsPerCrawl": "50", "contentSelectors": "article, .doc-content, .post-content", "titleSelectors": "h1, .post-title", "minimumTextLength": "300", "contentType": "auto", "maxDepth": "2", "respectRobotsTxt": "true", "useProxy": "true", "language": "en" } ## 🎯 Use Cases - LLM Training Data Collection: Scrape documentation and articles for fine-tuning language models - Knowledge Base Building: Extract structured content from documentation sites - Content Analysis: Collect and analyze text content from multiple sources - Dataset Creation: Build custom datasets for machine learning projects - Content Migration: Extract content from websites for migration or archival ## 🔧 How It Works 1. URL Discovery: Starts from provided URLs and follows links up to the specified depth 2. Content Extraction: Uses CSS selectors to extract main content and titles 3. Text Cleaning: Removes HTML, scripts, styles, and normalizes whitespace 4. Content Classification: Automatically detects content type using heuristics 5. Token Estimation: Calculates approximate token counts for LLM training 6. Data Storage: Saves cleaned, structured data to Apify Dataset ## 📊 Content Type Detection The Actor automatically detects content types using heuristics: - FAQ: Contains "faq" or "frequently asked" keywords - Guide: Contains "how to", "step", or "guide" keywords - Documentation: Contains "documentation" or "api reference" keywords - Article: Long-form content (>1000 words) or default fallback ## ⚙️ Configuration Tips ### For Documentation Sites json { "contentSelectors": "article, .doc-content, .documentation-content, main", "titleSelectors": "h1, .doc-title, .page-title" } ### For Blog Sites json { "contentSelectors": "article, .post-content, .entry-content, .blog-post", "titleSelectors": "h1, .post-title, .entry-title" } ### For FAQ Pages json { "contentSelectors": ".faq, .faq-item, .question-answer, article", "minimumTextLength": "100" } ## 🚨 Important Notes - Respect robots.txt: The Actor respects robots.txt by default. Disable only if you have permission - Rate Limiting: Built-in delays prevent overloading target servers - Content Filtering: Use minimumTextLength to filter out navigation and boilerplate - Proxy Usage: Apify Proxy helps avoid IP blocking and rate limits ## 📚 Resources - Apify Documentation - Crawlee Documentation - Apify Actor Development Guide ## 🤝 Contributing This Actor follows Apify Actor best practices: - Uses CheerioCrawler for fast static HTML scraping - Implements proper error handling and retry logic - Respects website terms and robots.txt - Provides clean, structured output ## 📄 License ISC ## 🔗 Links - Actor on Apify: View on Apify Platform - Apify CLI: Installation Guide ## 💡 Tips for Best Results 1. Start Small: Test with a few URLs first to verify selectors work 2. Adjust Selectors: Different sites need different CSS selectors - customize as needed 3. Set Depth Carefully: Higher depth = more pages but longer runtime 4. Filter by Length: Use minimumTextLength to avoid capturing navigation/headers 5. Monitor Progress: Check the Apify Console for real-time crawling progress --- Built with ❤️ using Apify SDK and Crawlee
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Ai Content Scraper Cleaner now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- dashjeevanthedev
- Pricing
- Paid
- Total Runs
- 7
- Active Users
- 2
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support