Ai Content Scraper Cleaner

Name: Ai Content Scraper Cleaner
Author: dashjeevanthedev

by dashjeevanthedev

AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON data...

7 runs

2 users

Try This Actor

Opens on Apify.com

About Ai Content Scraper Cleaner

AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.

What does this actor do?

Ai Content Scraper Cleaner is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

AI Content Scraper & Cleaner An Apify Actor that scrapes structured content (documentation, articles, FAQs, blog posts) and automatically converts it into clean, normalized JSON datasets suitable for LLM training and fine-tuning. ## 🚀 Features - Intelligent Content Extraction: Automatically extracts main content using configurable CSS selectors - Content Type Detection: Automatically detects content types (FAQ, article, guide, documentation, blog) - Text Cleaning: Removes HTML tags, scripts, styles, and normalizes whitespace - Token Estimation: Estimates token counts for LLM training (useful for dataset planning) - Language Detection: Optional language filtering support - Respectful Crawling: Honors robots.txt and implements rate limiting - Proxy Support: Built-in Apify Proxy integration for reliable scraping - Structured Output: Clean JSON dataset items with metadata ## 📋 Input Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `startUrls` | string | - | Comma-separated list of URLs to start crawling from (required) | | `maxRequestsPerCrawl` | string | "50" | Maximum number of requests allowed for this run | | `contentSelectors` | string | "article, .doc-content, .post-content" | Comma-separated CSS selectors for main content extraction | | `titleSelectors` | string | "h1, .post-title" | Comma-separated CSS selectors for title extraction | | `minimumTextLength` | string | "300" | Ignore content shorter than this many characters | | `contentType` | string | "auto" | Content type override (auto, faq, article, guide, documentation, blog, other) | | `maxDepth` | string | "2" | Maximum link-following depth from start URL | | `respectRobotsTxt` | string | "true" | Whether to honor robots.txt rules | | `useProxy` | string | "true" | Rotate proxies via Apify proxy when available | | `language` | string | "" | Optional language code filter (e.g., "en") | ## 📤 Output The Actor outputs structured JSON dataset items with the following fields: - url: Source URL of the scraped content - title: Extracted page title - content: Cleaned text content (HTML removed, normalized) - contentType: Detected or specified content type - tokensEstimate: Estimated token count for LLM training - language: Detected or specified language code - extractedAt: ISO timestamp of extraction ## 🛠️ Installation & Usage ### Prerequisites - Node.js 20+ installed - Apify CLI installed (Installation Guide) ### Local Development 1. Clone or navigate to the Actor directory: `bash cd AI-Ready-Dataset` 2. Install dependencies: `bash npm install` 3. Configure input: Edit `input.json` with your target URLs: `json { "startUrls": "https://example.com/docs, https://example.com/blog", "maxRequestsPerCrawl": "100", "minimumTextLength": "300" }` 4. Run locally: `bash apify run` 5. View results: Check `storage/datasets/default/` for scraped data ### Deploy to Apify Cloud 1. Authenticate: `bash apify login` 2. Deploy: `bash apify push` 3. Run on Apify: - Use the Apify Console UI, or - Use CLI: `apify call <actor-id>` ## 📝 Example Input `json { "startUrls": "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors", "maxRequestsPerCrawl": "50", "contentSelectors": "article, .doc-content, .post-content", "titleSelectors": "h1, .post-title", "minimumTextLength": "300", "contentType": "auto", "maxDepth": "2", "respectRobotsTxt": "true", "useProxy": "true", "language": "en" }` ## 🎯 Use Cases - LLM Training Data Collection: Scrape documentation and articles for fine-tuning language models - Knowledge Base Building: Extract structured content from documentation sites - Content Analysis: Collect and analyze text content from multiple sources - Dataset Creation: Build custom datasets for machine learning projects - Content Migration: Extract content from websites for migration or archival ## 🔧 How It Works 1. URL Discovery: Starts from provided URLs and follows links up to the specified depth 2. Content Extraction: Uses CSS selectors to extract main content and titles 3. Text Cleaning: Removes HTML, scripts, styles, and normalizes whitespace 4. Content Classification: Automatically detects content type using heuristics 5. Token Estimation: Calculates approximate token counts for LLM training 6. Data Storage: Saves cleaned, structured data to Apify Dataset ## 📊 Content Type Detection The Actor automatically detects content types using heuristics: - FAQ: Contains "faq" or "frequently asked" keywords - Guide: Contains "how to", "step", or "guide" keywords - Documentation: Contains "documentation" or "api reference" keywords - Article: Long-form content (>1000 words) or default fallback ## ⚙️ Configuration Tips ### For Documentation Sites `json { "contentSelectors": "article, .doc-content, .documentation-content, main", "titleSelectors": "h1, .doc-title, .page-title" }` ### For Blog Sites `json { "contentSelectors": "article, .post-content, .entry-content, .blog-post", "titleSelectors": "h1, .post-title, .entry-title" }` ### For FAQ Pages `json { "contentSelectors": ".faq, .faq-item, .question-answer, article", "minimumTextLength": "100" }` ## 🚨 Important Notes - Respect robots.txt: The Actor respects robots.txt by default. Disable only if you have permission - Rate Limiting: Built-in delays prevent overloading target servers - Content Filtering: Use `minimumTextLength` to filter out navigation and boilerplate - Proxy Usage: Apify Proxy helps avoid IP blocking and rate limits ## 📚 Resources - Apify Documentation - Crawlee Documentation - Apify Actor Development Guide ## 🤝 Contributing This Actor follows Apify Actor best practices: - Uses CheerioCrawler for fast static HTML scraping - Implements proper error handling and retry logic - Respects website terms and robots.txt - Provides clean, structured output ## 📄 License ISC ## 🔗 Links - Actor on Apify: View on Apify Platform - Apify CLI: Installation Guide ## 💡 Tips for Best Results 1. Start Small: Test with a few URLs first to verify selectors work 2. Adjust Selectors: Different sites need different CSS selectors - customize as needed 3. Set Depth Carefully: Higher depth = more pages but longer runtime 4. Filter by Length: Use `minimumTextLength` to avoid capturing navigation/headers 5. Monitor Progress: Check the Apify Console for real-time crawling progress --- Built with ❤️ using Apify SDK and Crawlee

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Ai Content Scraper Cleaner now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: dashjeevanthedev
Pricing: Paid
Total Runs: 7
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support