Dataset Quality Scorer

Dataset Quality Scorer

by fiery_dream

Score and improve your ML datasets. This tool checks for completeness, consistency, duplicates, balance, data drift, and outliers, then recommends fixes.

14 runs
2 users
Try This Actor

Opens on Apify.com

About Dataset Quality Scorer

Ever feel like your machine learning models are only as good as the data you feed them? You're right. The Dataset Quality Scorer is the actor I run to sanity-check my datasets before they go anywhere near a model. It doesn't just run basic stats; it digs into what actually matters for training. It scores your data on critical dimensions like completeness (are there missing values?), consistency (is the formatting a mess?), and balance (is one class dominating?). It hunts down sneaky duplicates that can skew your results and flags outliers that might be errors or critical edge cases. One of its best features is monitoring for data drift—helping you catch when your live data starts to diverge from your training set, which is a classic reason models degrade over time. After the analysis, it doesn't just leave you with a score. It gives you clear, actionable recommendations on how to fix the issues it finds. I use it as a final checkpoint to clean my data, validate new data sources, and maintain the health of models in production. It turns a vague worry about data quality into a clear report card and a to-do list.

What does this actor do?

Dataset Quality Scorer is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Dataset Quality Scorer

An Apify actor that automatically scores and diagnoses the quality of machine learning datasets. It checks for common data issues, detects drift and outliers, and generates actionable reports to improve data health before model training.

Key Features

  • Quality Scoring: Calculates a comprehensive score based on completeness, consistency, duplicates, and class balance.
  • Data Drift Detection: Compares a current dataset against a baseline to identify significant distribution shifts.
  • Outlier Detection: Finds anomalies using multiple methods (Z-score, IQR, Isolation Forest).
  • Schema Validation: Ensures data matches an expected structure, checking column presence and data types.
  • Full Report Generation: Runs all checks and combines results into a single, detailed analysis.

How to Use

Choose an operation and provide your dataset via datasetUrl (a URL or Apify dataset ID) or inline as datasetData. Configure the specific checks you need.

Common Operations & Parameters

  • scoreQuality: Scores the dataset. Use checkDuplicates and checkBalance flags; specify targetColumn for balance analysis.
  • detectDrift: Compares against a baselineDataset to find distribution changes.
  • findOutliers: Detects anomalies. Choose an outlierMethod (zscore, iqr, isolation_forest) and set the outlierThreshold.
  • validateSchema: Validates data structure against a provided schemaDefinition.
  • generateReport: Runs a full suite analysis (recommended).

Example Input

To run a full report with schema validation and outlier detection:

{
  "operation": "generateReport",
  "datasetUrl": "https://example.com/data.csv",
  "schemaDefinition": {
    "columns": {
      "id": "number",
      "feature1": "number",
      "label": "string"
    },
    "required": ["id", "label"]
  },
  "checkDuplicates": true,
  "checkBalance": true,
  "targetColumn": "label",
  "outlierMethod": "zscore",
  "outlierThreshold": 3
}

Input/Output

Main Input Parameters:
* operation (required): The analysis to run (scoreQuality, detectDrift, findOutliers, validateSchema, generateReport).
* datasetUrl / datasetData: The dataset to analyze.
* schemaDefinition: JSON schema for validation.
* baselineDataset: Baseline dataset for drift detection.
* outlierMethod: Algorithm for outlier detection.
* outlierThreshold: Sensitivity threshold for outliers (default: 3).
* checkDuplicates, checkBalance: Toggle specific quality checks.
* targetColumn: Column name for class balance analysis.

Output:
The actor returns a structured JSON object containing quality scores, lists of detected issues (like missing values, duplicates, drift, or outliers), and specific recommendations for improving the dataset.

Categories

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Dataset Quality Scorer now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
fiery_dream
Pricing
Paid
Total Runs
14
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support