Apify Smart Dataset Comparator

by agenscrape

Instantly compare 2-10 Apify datasets to find changes, merge data, and track quality. Essential for price monitoring, lead deduplication, and reliable data pipelines.

23 runs
2 users
Try This Actor

Opens on Apify.com

About Apify Smart Dataset Comparator

Ever needed to see exactly what changed between two datasets? The Apify Smart Dataset Comparator does just that. I use it to take 2 to 10 Apify datasets and instantly spot the differences—whether it's new records, removed items, or subtle field-level changes. It goes beyond simple comparison by helping you merge data intelligently, validate schemas, clean up inconsistencies, and even flag anomalies that don't look right. For practical work, it's a lifesaver. If you're tracking prices across competitors, it shows you which numbers actually moved. Running lead generation? It's perfect for deduplicating lists so you're not contacting the same person twice. Most importantly, it acts as a quality checkpoint for any data pipeline, catching errors before they mess up your reports. You get a clear, actionable view of your data's evolution without manual spreadsheet digging.

What does this actor do?

Apify Smart Dataset Comparator is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Apify Smart Dataset Comparator

Compare 2-10 Apify datasets to detect changes, find new or removed records, deduplicate data, and merge results using custom rules. Useful for price monitoring, lead list deduplication, SEO tracking, and data quality validation.

Facing an issue, unexpected error, edge case, or have a feature suggestion? Post it here and we'll address it within 24 hours.

Overview

This actor analyzes multiple datasets, identifies differences, and provides a structured output of changes. It works by matching records across datasets using a primary key you define.

Key Features

  • Change Detection: Compares records by primary key and shows field-level before/after diffs.
  • Record Tracking: Identifies records that are new, removed, or unchanged.
  • Data Merging: Combines unique records using a chosen merge strategy (e.g., newest dataset wins, average numbers).
  • Duplicate Detection: Finds exact duplicates and can use fuzzy matching (Levenshtein distance) for similar records.
  • Schema Validation: Checks for field consistency, type conflicts, and missing fields across datasets.
  • Smart Presets: Pre-configured settings for common use cases:
    • price_monitoring: For e-commerce; includes price tolerance and ignores timestamps.
    • lead_lists: For CRM; normalizes contact info and enables fuzzy deduplication.
    • seo: For content tracking; uses strict comparison and URL normalization.
    • real_estate: For property listings; applies tight price tolerance and phone normalization.
  • Data Cleaning: Automatically normalizes emails, phone numbers, URLs (removes tracking parameters), and currency values before comparison.

How to Use

Provide the dataset IDs and a primary key field. The actor will process and return the comparison results.

Basic Input Example:

{
  "datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],
  "primaryKey": "url"
}

Merge Strategies (when the same record exists in multiple datasets):
* left_priority: First dataset wins (default).
* right_priority: Last/newest dataset wins.
* most_recent: Record with the newest timestamp wins.
* most_complete: Record with the most filled fields wins.
* combine_arrays: Merge array fields from all records.
* average_numbers: Average numeric fields.

Input Parameters

Required

Parameter Type Description
datasetIds array 2-10 Apify dataset IDs to compare.
primaryKey string Field to uniquely identify records (supports product.id dot notation).

Optional

Parameter Type Default Description
preset string - Use a smart preset: price_monitoring, lead_lists, seo, real_estate.
ignoreFields array [] Fields to skip during comparison.
sensitivity string strict Comparison strictness: strict, medium, relaxed.
numericTolerance number 0 Ignore numeric changes below this percentage.
detectDuplicates boolean false Find duplicates within individual datasets.
fuzzyMatching boolean false Enable fuzzy duplicate detection.
fuzzyThreshold number 0.85 Similarity threshold (0-1) for fuzzy matching.
validateSchema boolean false Compare field schemas across datasets.
mergeStrategy string left_priority Strategy for merging conflicting records.
webhookUrl string - URL to send a notification upon completion.

Output

The actor generates a dataset containing the following views:

  • Changes: Records that were modified, with a detailed diff of altered fields.
  • New Records: Records present only in the newer dataset(s).
  • Removed Records: Records present only in the older dataset(s).
  • Merged Data: A consolidated dataset of all unique records, merged according to your chosen strategy.
  • Duplicates: A report of exact and fuzzy duplicate records found within the input datasets.
  • Schema Analysis: A summary of field types, conflicts, and consistency across datasets.
  • Anomalies: Flags significant changes, like price swings >50% or stock depletions.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Apify Smart Dataset Comparator now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
agenscrape
Pricing
Paid
Total Runs
23
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support