Dataset Quality Scorer

Name: Dataset Quality Scorer
Author: fiery_dream

by fiery_dream

Score and improve your ML datasets. This tool checks for completeness, consistency, duplicates, balance, data drift, and outliers, then recommends fixes.

14 runs

2 users

Try This Actor

Opens on Apify.com

About Dataset Quality Scorer

Ever feel like your machine learning models are only as good as the data you feed them? You're right. The Dataset Quality Scorer is the actor I run to sanity-check my datasets before they go anywhere near a model. It doesn't just run basic stats; it digs into what actually matters for training. It scores your data on critical dimensions like completeness (are there missing values?), consistency (is the formatting a mess?), and balance (is one class dominating?). It hunts down sneaky duplicates that can skew your results and flags outliers that might be errors or critical edge cases. One of its best features is monitoring for data drift—helping you catch when your live data starts to diverge from your training set, which is a classic reason models degrade over time. After the analysis, it doesn't just leave you with a score. It gives you clear, actionable recommendations on how to fix the issues it finds. I use it as a final checkpoint to clean my data, validate new data sources, and maintain the health of models in production. It turns a vague worry about data quality into a clear report card and a to-do list.

What does this actor do?

Dataset Quality Scorer is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Dataset Quality Scorer

An Apify actor that automatically scores and diagnoses the quality of machine learning datasets. It checks for common data issues, detects drift and outliers, and generates actionable reports to improve data health before model training.

Key Features

Quality Scoring: Calculates a comprehensive score based on completeness, consistency, duplicates, and class balance.
Data Drift Detection: Compares a current dataset against a baseline to identify significant distribution shifts.
Outlier Detection: Finds anomalies using multiple methods (Z-score, IQR, Isolation Forest).
Schema Validation: Ensures data matches an expected structure, checking column presence and data types.
Full Report Generation: Runs all checks and combines results into a single, detailed analysis.

How to Use

Choose an operation and provide your dataset via datasetUrl (a URL or Apify dataset ID) or inline as datasetData. Configure the specific checks you need.

Common Operations & Parameters

scoreQuality: Scores the dataset. Use checkDuplicates and checkBalance flags; specify targetColumn for balance analysis.
detectDrift: Compares against a baselineDataset to find distribution changes.
findOutliers: Detects anomalies. Choose an outlierMethod (zscore, iqr, isolation_forest) and set the outlierThreshold.
validateSchema: Validates data structure against a provided schemaDefinition.
generateReport: Runs a full suite analysis (recommended).

Example Input

To run a full report with schema validation and outlier detection:

{
  "operation": "generateReport",
  "datasetUrl": "https://example.com/data.csv",
  "schemaDefinition": {
    "columns": {
      "id": "number",
      "feature1": "number",
      "label": "string"
    },
    "required": ["id", "label"]
  },
  "checkDuplicates": true,
  "checkBalance": true,
  "targetColumn": "label",
  "outlierMethod": "zscore",
  "outlierThreshold": 3
}

Input/Output

Main Input Parameters:
* operation (required): The analysis to run (scoreQuality, detectDrift, findOutliers, validateSchema, generateReport).
* datasetUrl / datasetData: The dataset to analyze.
* schemaDefinition: JSON schema for validation.
* baselineDataset: Baseline dataset for drift detection.
* outlierMethod: Algorithm for outlier detection.
* outlierThreshold: Sensitivity threshold for outliers (default: 3).
* checkDuplicates, checkBalance: Toggle specific quality checks.
* targetColumn: Column name for class balance analysis.

Output:
The actor returns a structured JSON object containing quality scores, lists of detected issues (like missing values, duplicates, drift, or outliers), and specific recommendations for improving the dataset.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Dataset Quality Scorer now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: fiery_dream
Pricing: Paid
Total Runs: 14
Active Users: 2

Related Actors

Similarweb scraper

by curious_coder

Google Ads Scraper

by silva95gustavo

Cheap Google Search Results Scraper

by tuningsearch

G2 Explorer

by jupri

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Dataset Quality Scorer

About Dataset Quality Scorer

What does this actor do?

Key Features

How to Use

Documentation

Dataset Quality Scorer

Key Features

How to Use

Common Operations & Parameters

Example Input

Input/Output

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?