Apify Smart Dataset Comparator
by agenscrape
Instantly compare 2-10 Apify datasets to find changes, merge data, and track quality. Essential for price monitoring, lead deduplication, and reliable data pipelines.
Opens on Apify.com
About Apify Smart Dataset Comparator
Ever needed to see exactly what changed between two datasets? The Apify Smart Dataset Comparator does just that. I use it to take 2 to 10 Apify datasets and instantly spot the differences—whether it's new records, removed items, or subtle field-level changes. It goes beyond simple comparison by helping you merge data intelligently, validate schemas, clean up inconsistencies, and even flag anomalies that don't look right. For practical work, it's a lifesaver. If you're tracking prices across competitors, it shows you which numbers actually moved. Running lead generation? It's perfect for deduplicating lists so you're not contacting the same person twice. Most importantly, it acts as a quality checkpoint for any data pipeline, catching errors before they mess up your reports. You get a clear, actionable view of your data's evolution without manual spreadsheet digging.
What does this actor do?
Apify Smart Dataset Comparator is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Apify Smart Dataset Comparator
Compare 2-10 Apify datasets to detect changes, find new or removed records, deduplicate data, and merge results using custom rules. Useful for price monitoring, lead list deduplication, SEO tracking, and data quality validation.
Facing an issue, unexpected error, edge case, or have a feature suggestion? Post it here and we'll address it within 24 hours.
Overview
This actor analyzes multiple datasets, identifies differences, and provides a structured output of changes. It works by matching records across datasets using a primary key you define.
Key Features
- Change Detection: Compares records by primary key and shows field-level before/after diffs.
- Record Tracking: Identifies records that are new, removed, or unchanged.
- Data Merging: Combines unique records using a chosen merge strategy (e.g., newest dataset wins, average numbers).
- Duplicate Detection: Finds exact duplicates and can use fuzzy matching (Levenshtein distance) for similar records.
- Schema Validation: Checks for field consistency, type conflicts, and missing fields across datasets.
- Smart Presets: Pre-configured settings for common use cases:
price_monitoring: For e-commerce; includes price tolerance and ignores timestamps.lead_lists: For CRM; normalizes contact info and enables fuzzy deduplication.seo: For content tracking; uses strict comparison and URL normalization.real_estate: For property listings; applies tight price tolerance and phone normalization.
- Data Cleaning: Automatically normalizes emails, phone numbers, URLs (removes tracking parameters), and currency values before comparison.
How to Use
Provide the dataset IDs and a primary key field. The actor will process and return the comparison results.
Basic Input Example:
{
"datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],
"primaryKey": "url"
}
Merge Strategies (when the same record exists in multiple datasets):
* left_priority: First dataset wins (default).
* right_priority: Last/newest dataset wins.
* most_recent: Record with the newest timestamp wins.
* most_complete: Record with the most filled fields wins.
* combine_arrays: Merge array fields from all records.
* average_numbers: Average numeric fields.
Input Parameters
Required
| Parameter | Type | Description |
|---|---|---|
datasetIds |
array | 2-10 Apify dataset IDs to compare. |
primaryKey |
string | Field to uniquely identify records (supports product.id dot notation). |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
preset |
string | - | Use a smart preset: price_monitoring, lead_lists, seo, real_estate. |
ignoreFields |
array | [] |
Fields to skip during comparison. |
sensitivity |
string | strict |
Comparison strictness: strict, medium, relaxed. |
numericTolerance |
number | 0 |
Ignore numeric changes below this percentage. |
detectDuplicates |
boolean | false |
Find duplicates within individual datasets. |
fuzzyMatching |
boolean | false |
Enable fuzzy duplicate detection. |
fuzzyThreshold |
number | 0.85 |
Similarity threshold (0-1) for fuzzy matching. |
validateSchema |
boolean | false |
Compare field schemas across datasets. |
mergeStrategy |
string | left_priority |
Strategy for merging conflicting records. |
webhookUrl |
string | - | URL to send a notification upon completion. |
Output
The actor generates a dataset containing the following views:
- Changes: Records that were modified, with a detailed diff of altered fields.
- New Records: Records present only in the newer dataset(s).
- Removed Records: Records present only in the older dataset(s).
- Merged Data: A consolidated dataset of all unique records, merged according to your chosen strategy.
- Duplicates: A report of exact and fuzzy duplicate records found within the input datasets.
- Schema Analysis: A summary of field types, conflicts, and consistency across datasets.
- Anomalies: Flags significant changes, like price swings >50% or stock depletions.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Apify Smart Dataset Comparator now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- agenscrape
- Pricing
- Paid
- Total Runs
- 23
- Active Users
- 2
Related Actors
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Linkedin Profile Details Scraper + EMAIL (No Cookies Required)
by apimaestro
Twitter (X.com) Scraper Unlimited: No Limits
by apidojo
Content Checker
by jakubbalada
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support