Ai Training Data Enricher

Name: Ai Training Data Enricher
Author: fiery_dream

by fiery_dream

Automatically clean, enrich, and validate your LLM training datasets. Prepare production-grade data for AI fine-tuning and save hours of manual work.

17 runs

2 users

Try This Actor

Opens on Apify.com

About Ai Training Data Enricher

Ever feel like your AI model's training data is a bit of a mess? You're not alone. I've spent hours cleaning up datasets, and that's exactly why I built the Ai Training Data Enricher. It's my go-to for getting data ready for fine-tuning. Think of it as a final quality check and upgrade for your datasets. It automatically finds and removes duplicate entries that can skew your model's learning, fills in missing information to enrich your data points, and runs validation to catch inconsistencies. This process turns a raw, noisy collection of data into a clean, reliable foundation. I use it to prep data for chatbot training, agent simulations, and any project where the quality of the input directly determines the quality of the AI's output. It saves a massive amount of manual review time and gives me confidence that my models are learning from the best possible information. Skip the pre-training headache and get your data pipeline production-ready.

What does this actor do?

Ai Training Data Enricher is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

AI Training Data Enricher & Validator

Overview

An Apify actor for cleaning, enriching, and validating datasets before LLM fine-tuning. It processes your data to remove duplicates, detect sensitive information, and add analytical metadata, helping to prevent overfitting, privacy violations, and bias in trained models.

Key Features

Enrichment
* Sentiment Analysis: Scores text from -5 (negative) to +5 (positive) using the AFINN lexicon.
* Named Entity Recognition: Extracts people, places, organizations, dates, and values.
* Keyword Extraction: Identifies key terms using TF-IDF weighting.
* Language Detection: Detects text language with a confidence score.
* Readability Metrics: Calculates word count, sentence count, and complexity scores.

Validation
* Duplicate Detection: Fuzzy string matching with a configurable similarity threshold (0.5 to 1.0).
* PII Detection: Finds emails, phone numbers, SSNs, and credit card numbers for GDPR compliance.
* Schema Validation: Validates items against a JSON Schema.
* Length Filtering: Enforces minimum and maximum character limits.
* Quality Flags: Optional "flag-only" mode to mark invalid items without removing them.

Privacy & Compliance
* PII Redaction: Can automatically replace detected sensitive data with [REDACTED].
* Audit Trail: Provides a complete validation history for each item.

How to Use

Input Data Format

Prepare a dataset where each item contains at least a text field.

{
  "text": "Your training sample text here.",
  "label": "optional_label"
}

Basic Configuration

Run the actor with a configuration like this:

{
  "datasetId": "your-dataset-id",
  "textField": "text",
  "enrichmentOptions": {
    "sentiment": true,
    "entities": true,
    "keywords": true,
    "language": true,
    "readability": true
  },
  "validationOptions": {
    "detectDuplicates": true,
    "duplicateSimilarityThreshold": 0.85,
    "detectPII": true,
    "minTextLength": 10,
    "maxTextLength": 0
  },
  "outputOptions": {
    "includeOriginal": true,
    "flagOnly": false,
    "removePII": false
  }
}

Input / Output

Input: Your source dataset via an Apify dataset ID.

Output: A new dataset with enriched and validated items. Each output item has the following structure:

{
  "id": 0,
  "originalText": "Apple Inc. released iPhone in 2007. Great product!",
  "enrichment": {
    "sentiment": {
      "score": 3,
      "comparative": 0.375,
      "positive": ["great"],
      "negative": []
    },
    "entities": {
      "organizations": ["Apple Inc."],
      "dates": ["2007"]
    },
    "keywords": ["apple", "iphone", "released", "product"],
    "language": "english",
    "readability": {
      "wordCount": 8,
      "sentenceCount": 2
    }
  },
  "validation": {
    "isValid": true,
    "isDuplicate": false,
    "hasPII": false,
    "lengthValid": true,
    "schemaValid": true
  }
}

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Ai Training Data Enricher now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: fiery_dream
Pricing: Paid
Total Runs: 17
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Ai Training Data Enricher

About Ai Training Data Enricher

What does this actor do?

Key Features

How to Use

Documentation

AI Training Data Enricher & Validator

Overview

Key Features

How to Use

Input Data Format

Basic Configuration

Input / Output

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?