Ai Training Data Enricher
by fiery_dream
Automatically clean, enrich, and validate your LLM training datasets. Prepare production-grade data for AI fine-tuning and save hours of manual work.
Opens on Apify.com
About Ai Training Data Enricher
Ever feel like your AI model's training data is a bit of a mess? You're not alone. I've spent hours cleaning up datasets, and that's exactly why I built the Ai Training Data Enricher. It's my go-to for getting data ready for fine-tuning. Think of it as a final quality check and upgrade for your datasets. It automatically finds and removes duplicate entries that can skew your model's learning, fills in missing information to enrich your data points, and runs validation to catch inconsistencies. This process turns a raw, noisy collection of data into a clean, reliable foundation. I use it to prep data for chatbot training, agent simulations, and any project where the quality of the input directly determines the quality of the AI's output. It saves a massive amount of manual review time and gives me confidence that my models are learning from the best possible information. Skip the pre-training headache and get your data pipeline production-ready.
What does this actor do?
Ai Training Data Enricher is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
AI Training Data Enricher & Validator
Overview
An Apify actor for cleaning, enriching, and validating datasets before LLM fine-tuning. It processes your data to remove duplicates, detect sensitive information, and add analytical metadata, helping to prevent overfitting, privacy violations, and bias in trained models.
Key Features
Enrichment
* Sentiment Analysis: Scores text from -5 (negative) to +5 (positive) using the AFINN lexicon.
* Named Entity Recognition: Extracts people, places, organizations, dates, and values.
* Keyword Extraction: Identifies key terms using TF-IDF weighting.
* Language Detection: Detects text language with a confidence score.
* Readability Metrics: Calculates word count, sentence count, and complexity scores.
Validation
* Duplicate Detection: Fuzzy string matching with a configurable similarity threshold (0.5 to 1.0).
* PII Detection: Finds emails, phone numbers, SSNs, and credit card numbers for GDPR compliance.
* Schema Validation: Validates items against a JSON Schema.
* Length Filtering: Enforces minimum and maximum character limits.
* Quality Flags: Optional "flag-only" mode to mark invalid items without removing them.
Privacy & Compliance
* PII Redaction: Can automatically replace detected sensitive data with [REDACTED].
* Audit Trail: Provides a complete validation history for each item.
How to Use
Input Data Format
Prepare a dataset where each item contains at least a text field.
{
"text": "Your training sample text here.",
"label": "optional_label"
}
Basic Configuration
Run the actor with a configuration like this:
{
"datasetId": "your-dataset-id",
"textField": "text",
"enrichmentOptions": {
"sentiment": true,
"entities": true,
"keywords": true,
"language": true,
"readability": true
},
"validationOptions": {
"detectDuplicates": true,
"duplicateSimilarityThreshold": 0.85,
"detectPII": true,
"minTextLength": 10,
"maxTextLength": 0
},
"outputOptions": {
"includeOriginal": true,
"flagOnly": false,
"removePII": false
}
}
Input / Output
Input: Your source dataset via an Apify dataset ID.
Output: A new dataset with enriched and validated items. Each output item has the following structure:
{
"id": 0,
"originalText": "Apple Inc. released iPhone in 2007. Great product!",
"enrichment": {
"sentiment": {
"score": 3,
"comparative": 0.375,
"positive": ["great"],
"negative": []
},
"entities": {
"organizations": ["Apple Inc."],
"dates": ["2007"]
},
"keywords": ["apple", "iphone", "released", "product"],
"language": "english",
"readability": {
"wordCount": 8,
"sentenceCount": 2
}
},
"validation": {
"isValid": true,
"isDuplicate": false,
"hasPII": false,
"lengthValid": true,
"schemaValid": true
}
}
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Ai Training Data Enricher now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- fiery_dream
- Pricing
- Paid
- Total Runs
- 17
- Active Users
- 2
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support