Dataset Quality Scorer
by fiery_dream
Score and improve your ML datasets. This tool checks for completeness, consistency, duplicates, balance, data drift, and outliers, then recommends fixes.
Opens on Apify.com
About Dataset Quality Scorer
Ever feel like your machine learning models are only as good as the data you feed them? You're right. The Dataset Quality Scorer is the actor I run to sanity-check my datasets before they go anywhere near a model. It doesn't just run basic stats; it digs into what actually matters for training. It scores your data on critical dimensions like completeness (are there missing values?), consistency (is the formatting a mess?), and balance (is one class dominating?). It hunts down sneaky duplicates that can skew your results and flags outliers that might be errors or critical edge cases. One of its best features is monitoring for data drift—helping you catch when your live data starts to diverge from your training set, which is a classic reason models degrade over time. After the analysis, it doesn't just leave you with a score. It gives you clear, actionable recommendations on how to fix the issues it finds. I use it as a final checkpoint to clean my data, validate new data sources, and maintain the health of models in production. It turns a vague worry about data quality into a clear report card and a to-do list.
What does this actor do?
Dataset Quality Scorer is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Dataset Quality Scorer
An Apify actor that automatically scores and diagnoses the quality of machine learning datasets. It checks for common data issues, detects drift and outliers, and generates actionable reports to improve data health before model training.
Key Features
- Quality Scoring: Calculates a comprehensive score based on completeness, consistency, duplicates, and class balance.
- Data Drift Detection: Compares a current dataset against a baseline to identify significant distribution shifts.
- Outlier Detection: Finds anomalies using multiple methods (Z-score, IQR, Isolation Forest).
- Schema Validation: Ensures data matches an expected structure, checking column presence and data types.
- Full Report Generation: Runs all checks and combines results into a single, detailed analysis.
How to Use
Choose an operation and provide your dataset via datasetUrl (a URL or Apify dataset ID) or inline as datasetData. Configure the specific checks you need.
Common Operations & Parameters
scoreQuality: Scores the dataset. UsecheckDuplicatesandcheckBalanceflags; specifytargetColumnfor balance analysis.detectDrift: Compares against abaselineDatasetto find distribution changes.findOutliers: Detects anomalies. Choose anoutlierMethod(zscore,iqr,isolation_forest) and set theoutlierThreshold.validateSchema: Validates data structure against a providedschemaDefinition.generateReport: Runs a full suite analysis (recommended).
Example Input
To run a full report with schema validation and outlier detection:
{
"operation": "generateReport",
"datasetUrl": "https://example.com/data.csv",
"schemaDefinition": {
"columns": {
"id": "number",
"feature1": "number",
"label": "string"
},
"required": ["id", "label"]
},
"checkDuplicates": true,
"checkBalance": true,
"targetColumn": "label",
"outlierMethod": "zscore",
"outlierThreshold": 3
}
Input/Output
Main Input Parameters:
* operation (required): The analysis to run (scoreQuality, detectDrift, findOutliers, validateSchema, generateReport).
* datasetUrl / datasetData: The dataset to analyze.
* schemaDefinition: JSON schema for validation.
* baselineDataset: Baseline dataset for drift detection.
* outlierMethod: Algorithm for outlier detection.
* outlierThreshold: Sensitivity threshold for outliers (default: 3).
* checkDuplicates, checkBalance: Toggle specific quality checks.
* targetColumn: Column name for class balance analysis.
Output:
The actor returns a structured JSON object containing quality scores, lists of detected issues (like missing values, duplicates, drift, or outliers), and specific recommendations for improving the dataset.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Dataset Quality Scorer now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- fiery_dream
- Pricing
- Paid
- Total Runs
- 14
- Active Users
- 2
Related Actors
Similarweb scraper
by curious_coder
Google Ads Scraper
by silva95gustavo
Cheap Google Search Results Scraper
by tuningsearch
G2 Explorer
by jupri
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support