Ai Training Data Curator

Ai Training Data Curator

by mea

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to ...

7 runs
2 users
Try This Actor

Opens on Apify.com

About Ai Training Data Curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

What does this actor do?

Ai Training Data Curator is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

AI Training Data Curator Curate high-quality, deduplicated training data for LLM fine-tuning. Extract clean text from any website OR process your own documents with automatic quality scoring, deduplication, and format conversion. ## Features - Smart Content Extraction: Automatically detects and extracts main content, filtering out navigation, ads, and boilerplate - Bring Your Own Data (BYOD): Process your own text documents without crawling - perfect for existing datasets - Quality Scoring: Scores each document based on vocabulary diversity, sentence structure, and content density - Deduplication: Uses MinHash/Jaccard similarity to remove near-duplicate content - Flexible Crawling: Single page, same domain, same subdomain, or follow all links - Document Chunking: Split long documents into training-ready chunks with configurable overlap - Multiple Output Formats: JSONL (OpenAI compatible), JSON, Parquet, CSV, or HuggingFace Datasets format - Language Filtering: Filter content by language (ISO 639-1 codes) - Privacy Features: Optionally remove emails and URLs from extracted text ## Use Cases - LLM Fine-tuning: Collect domain-specific training data for fine-tuning language models - RAG Systems: Build high-quality document collections for retrieval-augmented generation - Knowledge Bases: Create clean text corpora from documentation sites - Research: Gather datasets from academic or technical resources - Data Cleaning: Clean and deduplicate existing text datasets for ML training ## Input Configuration ### Mode Selection The actor supports two modes - provide either start_urls (for crawling) or documents (for BYOD): | Field | Type | Default | Description | |-------|------|---------|-------------| | start_urls | array | - | URLs to start crawling from (Crawl mode) | | documents | array | - | Your own documents to process (BYOD mode) | ### BYOD (Bring Your Own Data) Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | documents | array | - | Array of text strings or objects with text field | | byod_text_field | string | text | Field name containing text in document objects | | max_byod_documents | integer | 500 | Maximum documents to process (hard limit) | ### Crawl Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | start_urls | array | - | URLs to start crawling from | | crawl_mode | string | same_domain | single_page, same_domain, same_subdomain, or all_links | | max_pages | integer | 100 | Maximum pages to crawl | | max_depth | integer | 3 | Maximum link depth from start URLs | ### Content Extraction | Field | Type | Default | Description | |-------|------|---------|-------------| | content_selectors | array | ["article", "main", ".content"] | CSS selectors for main content | | exclude_selectors | array | ["nav", "header", "footer", ".sidebar"] | CSS selectors to exclude | | min_word_count | integer | 100 | Minimum words per document | | max_word_count | integer | 50000 | Maximum words per document | ### Quality & Deduplication | Field | Type | Default | Description | |-------|------|---------|-------------| | deduplicate | boolean | true | Remove duplicate/near-duplicate content | | dedup_threshold | number | 0.85 | Similarity threshold (0.5-1.0) | | quality_filter | boolean | true | Filter low-quality content | | min_quality_score | number | 0.5 | Minimum quality score (0.0-1.0) | | language_filter | array | ["en"] | Languages to include (ISO codes) | ### Output Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | output_format | string | jsonl | jsonl, json, parquet, csv, or huggingface | | text_field_name | string | text | Name of the text field in output | | include_metadata | boolean | true | Include URL, title, date metadata | | include_raw_html | boolean | false | Also save original HTML | ### Chunking | Field | Type | Default | Description | |-------|------|---------|-------------| | chunk_documents | boolean | false | Split documents into chunks | | chunk_size | integer | 512 | Target chunk size in tokens | | chunk_overlap | integer | 64 | Overlap between chunks | ### Text Cleaning | Field | Type | Default | Description | |-------|------|---------|-------------| | clean_html | boolean | true | Remove HTML tags | | normalize_whitespace | boolean | true | Collapse multiple spaces/newlines | | remove_urls | boolean | false | Strip embedded URLs | | remove_emails | boolean | true | Strip email addresses | ### Performance | Field | Type | Default | Description | |-------|------|---------|-------------| | use_proxies | boolean | false | Use residential proxies | | max_concurrency | integer | 10 | Parallel requests | | request_delay_ms | integer | 500 | Delay between requests | | respect_robots_txt | boolean | true | Follow robots.txt rules | ## Output Format Each document in the output contains: json { "text": "The cleaned document text content...", "doc_id": "abc123def456", "source_url": "https://example.com/page", "word_count": 1523, "quality_score": 0.847, "language": "en", "title": "Page Title", "description": "Meta description", "content_type": "documentation", "scraped_at": "2024-01-15T10:30:00Z" } If chunking is enabled, additional fields are included: json { "chunk_index": 0, "total_chunks": 5, "parent_doc_id": "abc123def456" } ## Quality Metrics The quality scorer evaluates documents based on: - Word count: Penalizes very short documents - Sentence length: Flags very short (fragments) or very long sentences - Vocabulary diversity: Ratio of unique words to total words - Boilerplate ratio: Detection of common web boilerplate patterns - Character composition: Penalizes excessive uppercase, digits, or special characters Documents with scores below min_quality_score are automatically filtered out. ## Example Input ### Crawl Python Documentation json { "start_urls": [ { "url": "https://docs.python.org/3/tutorial/" } ], "crawl_mode": "same_subdomain", "max_pages": 500, "content_selectors": [".document", ".body"], "exclude_selectors": [".sphinxsidebar", ".related", "footer"], "output_format": "jsonl", "chunk_documents": true, "chunk_size": 1024 } ### Build Knowledge Base from Blog json { "start_urls": [ { "url": "https://example.com/blog/" } ], "crawl_mode": "same_domain", "max_pages": 100, "content_selectors": ["article", ".post-content"], "quality_filter": true, "min_quality_score": 0.6, "deduplicate": true, "output_format": "parquet" } ### BYOD: Process Your Own Documents json { "documents": [ "This is a plain text document that will be processed...", { "text": "This document has metadata attached to it...", "source_id": "doc_001", "metadata": { "title": "My Document", "author": "John Doe", "language": "en" } } ], "deduplicate": true, "quality_filter": true, "min_quality_score": 0.5, "output_format": "jsonl" } ### BYOD: Clean Existing Dataset json { "documents": [ {"text": "First document from your dataset..."}, {"text": "Second document from your dataset..."}, {"text": "Third document from your dataset..."} ], "byod_text_field": "text", "deduplicate": true, "dedup_threshold": 0.85, "chunk_documents": true, "chunk_size": 512, "output_format": "jsonl" } ## Tips for Best Results 1. Use specific content selectors: Better extraction with precise CSS selectors for your target site 2. Set appropriate word counts: Filter out navigation pages and indexes with min_word_count 3. Enable deduplication: Prevents training on repetitive content (common on content farms) 4. Adjust quality threshold: Lower for technical content, higher for prose 5. Use chunking for long documents: Better for training context windows 6. Start small: Test with max_pages: 20 before large crawls ## Pricing - $0.01 per document - charged for each cleaned document (both crawled and BYOD) Additional costs: - Proxy: ~$0.001-0.005 per request (if enabled) - Storage: ~$0.0001 per document ## Support - Apify Documentation - Report Issues - Crawlee Documentation

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Ai Training Data Curator now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
mea
Pricing
Paid
Total Runs
7
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support