Ai Training Data Enricher

Ai Training Data Enricher

by fiery_dream

Automatically clean, enrich, and validate your LLM training datasets. Prepare production-grade data for AI fine-tuning and save hours of manual work.

17 runs
2 users
Try This Actor

Opens on Apify.com

About Ai Training Data Enricher

Ever feel like your AI model's training data is a bit of a mess? You're not alone. I've spent hours cleaning up datasets, and that's exactly why I built the Ai Training Data Enricher. It's my go-to for getting data ready for fine-tuning. Think of it as a final quality check and upgrade for your datasets. It automatically finds and removes duplicate entries that can skew your model's learning, fills in missing information to enrich your data points, and runs validation to catch inconsistencies. This process turns a raw, noisy collection of data into a clean, reliable foundation. I use it to prep data for chatbot training, agent simulations, and any project where the quality of the input directly determines the quality of the AI's output. It saves a massive amount of manual review time and gives me confidence that my models are learning from the best possible information. Skip the pre-training headache and get your data pipeline production-ready.

What does this actor do?

Ai Training Data Enricher is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

AI Training Data Enricher & Validator

Apify Actor License: MIT

Overview

An Apify actor for cleaning, enriching, and validating datasets before LLM fine-tuning. It processes your data to remove duplicates, detect sensitive information, and add analytical metadata, helping to prevent overfitting, privacy violations, and bias in trained models.

Key Features

Enrichment
* Sentiment Analysis: Scores text from -5 (negative) to +5 (positive) using the AFINN lexicon.
* Named Entity Recognition: Extracts people, places, organizations, dates, and values.
* Keyword Extraction: Identifies key terms using TF-IDF weighting.
* Language Detection: Detects text language with a confidence score.
* Readability Metrics: Calculates word count, sentence count, and complexity scores.

Validation
* Duplicate Detection: Fuzzy string matching with a configurable similarity threshold (0.5 to 1.0).
* PII Detection: Finds emails, phone numbers, SSNs, and credit card numbers for GDPR compliance.
* Schema Validation: Validates items against a JSON Schema.
* Length Filtering: Enforces minimum and maximum character limits.
* Quality Flags: Optional "flag-only" mode to mark invalid items without removing them.

Privacy & Compliance
* PII Redaction: Can automatically replace detected sensitive data with [REDACTED].
* Audit Trail: Provides a complete validation history for each item.

How to Use

Input Data Format

Prepare a dataset where each item contains at least a text field.

{
  "text": "Your training sample text here.",
  "label": "optional_label"
}

Basic Configuration

Run the actor with a configuration like this:

{
  "datasetId": "your-dataset-id",
  "textField": "text",
  "enrichmentOptions": {
    "sentiment": true,
    "entities": true,
    "keywords": true,
    "language": true,
    "readability": true
  },
  "validationOptions": {
    "detectDuplicates": true,
    "duplicateSimilarityThreshold": 0.85,
    "detectPII": true,
    "minTextLength": 10,
    "maxTextLength": 0
  },
  "outputOptions": {
    "includeOriginal": true,
    "flagOnly": false,
    "removePII": false
  }
}

Input / Output

Input: Your source dataset via an Apify dataset ID.

Output: A new dataset with enriched and validated items. Each output item has the following structure:

{
  "id": 0,
  "originalText": "Apple Inc. released iPhone in 2007. Great product!",
  "enrichment": {
    "sentiment": {
      "score": 3,
      "comparative": 0.375,
      "positive": ["great"],
      "negative": []
    },
    "entities": {
      "organizations": ["Apple Inc."],
      "dates": ["2007"]
    },
    "keywords": ["apple", "iphone", "released", "product"],
    "language": "english",
    "readability": {
      "wordCount": 8,
      "sentenceCount": 2
    }
  },
  "validation": {
    "isValid": true,
    "isDuplicate": false,
    "hasPII": false,
    "lengthValid": true,
    "schemaValid": true
  }
}

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Ai Training Data Enricher now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
fiery_dream
Pricing
Paid
Total Runs
17
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support