Advanced Product Matcher Pro

Name: Advanced Product Matcher Pro
Author: datawhisperers

by datawhisperers

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similar...

26 runs

2 users

Try This Actor

Opens on Apify.com

About Advanced Product Matcher Pro

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

What does this actor do?

Advanced Product Matcher Pro is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

AI Product Matcher Actor A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation. ## Features - Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets - Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore - Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching - Configurable Attributes: Weight different product attributes based on importance - Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization - Performance Optimization: Group products by categories or other attributes for faster processing - Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models - Flexible Output: Customizable match results with similarity scores, original values, and additional output fields - Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors ## Quick Start ### Basic Configuration Example json { "dataFormat": "csv", "dataSource": "datasets", "dataset1": "catalog_products", "dataset1Name": "Catalog", "dataset1PrimaryKey": "ProductId", "dataset2": "retailer_products", "dataset2Name": "Retailer", "dataset2PrimaryKey": "ProductId", "threshold": 0.7, "maxMatches": 2, "language": "en", "groupByAttribute": "category", "csvSeparator": ",", "includeOriginalValues": true, "attributes": [ { "name": "title", "weight": 1.0, "useForMatching": true }, { "name": "brand", "weight": 0.8, "useForMatching": true }, { "name": "price", "weight": 0.3, "useForMatching": false } ] } ### Core Input Parameters | Parameter | Type | Description | Default | | :-- | :-- | :-- | :-- | | `dataFormat` | string | Data format: `"csv"` or `"json"` | `"json"` | | `dataSource` | string | Source type: `"datasets"` or `"keyvaluestore"` | `"datasets"` | | `keyValuestoreNameOrId` | string | Name or ID of KeyValueStore (if `dataSource: keyvaluestore`) | none | | `dataset1` | string | First dataset key/ID (CSV filename or Dataset ID) | required | | `dataset1Name` | string | Friendly name for dataset 1 | `"Dataset1"` | | `dataset1PrimaryKey` | string | Primary key field name in dataset 1 | `"ProductId"` | | `dataset2` | string | Second dataset key/ID | required | | `dataset2Name` | string | Friendly name for dataset 2 | `"Dataset2"` | | `dataset2PrimaryKey` | string | Primary key field name in dataset 2 | `"ProductId"` | | `threshold` | number | Minimum overall similarity score for matches (0.0–1.0) | `0.5` | | `maxMatches` | integer | Maximum number of matches returned per item | `2` | | `language` | string | Embedding model selection: `"en"`, `"multilingual"`, `"es"`, `"fr"`, `"de"`, `"it"`, `"pt"`, `"nl"` | `"en"` | | `groupByAttribute` | string | Attribute name to group by for efficient matching (optional) | none | | `csvSeparator` | string | CSV delimiter (only when `dataFormat: csv`) | `","` | | `includeOriginalValues` | boolean | Include original attribute values in the output records | `true` | | `dataset1OutputFields` | array | Include specific attribute values in the output records from dataset 1 | `["Field1"]` | | `dataset2OutputFields` | array | Include specific attribute values in the output records from dataset 2 | `["Field1", "Field2"]` | | `attributes` | array | Required. List of attribute configurations (see below) | required | ### Attribute Configuration Each attribute in `attributes` supports: - `name` (string, required) — Column name (CSV) or attribute key (JSON) - `weight` (number) — Importance weight for matching (higher = more important) - `useForMatching` (boolean) — Whether to include in similarity calculation - `jsonPath` (string) — JSON path expression for nested data - `wordsToRemove` (array) — List of words to strip before matching - `wordReplacements` (object) — Mapping of terms to replace prior to matching - `regex` (string) — Regex to apply during preprocessing - `normalizationRegex` (string) — Regex applied before similarity calculation - `normalizationReplacement` (string) — Replacement for normalization regex #### Text Preprocessing example `json { "name": "brand", "weight": 0.8, "useForMatching": true, "wordsToRemove": ["inc", "llc", "ltd", "corp"], "wordReplacements": { "apple": "apple inc", "samsung": "samsung electronics" }, "regex": "\\b(inc|llc|ltd|corp)\\b", "normalizationRegex": "[^a-zA-Z0-9\\s]", "normalizationReplacement": "" }` | Property | Type | Description | |----------|------|-------------| | `wordsToRemove` | array | Words to remove from text | | `wordReplacements` | object | Word substitution mapping | | `regex` | string | Regex pattern for text cleaning | | `normalizationRegex` | string | Regex for similarity calculation normalization | | `normalizationReplacement` | string | Replacement for normalization regex | ## Real-World Examples ### 1. E-commerce Catalog Matching json { "dataFormat": "csv", "dataSource": "datasets", "dataset1": "manufacturer_catalog.csv", "dataset1Name": "Manufacturer", "dataset1PrimaryKey": "ProductId", "dataset2": "retailer_inventory.csv", "dataset2Name": "Retailer", "dataset2PrimaryKey": "ProductId", "threshold": 0.75, "maxMatches": 3, "language": "en", "groupByAttribute": "category", "attributes": [ { "name": "product_name", "weight": 1.5, "useForMatching": true, "wordsToRemove": ["new", "original", "authentic"], "wordReplacements": {"&": "and", "w/": "with"} }, { "name": "brand", "weight": 1.2, "useForMatching": true, "wordsToRemove": ["inc", "llc", "corp"], "wordReplacements": {"apple": "apple inc", "hp": "hewlett packard"} }, { "name": "model_number", "weight": 1.8, "useForMatching": true, "normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": "" }, { "name": "price", "weight": 0.3, "useForMatching": false, "regex": "\\D" } ] } ### 2. Fashion Product Matching with Complex JSON Matching fashion products from different suppliers with nested JSON data: json { "dataFormat": "json", "dataSource": "datasets", "dataset1": "fashion_supplier_a", "dataset1Name": "SupplierA", "dataset1PrimaryKey": "ID", "dataset2": "fashion_supplier_b", "dataset2Name": "SupplierB", "dataset2PrimaryKey": "ID", "threshold": 0.65, "language": "multilingual", "maxMatches": 2, "attributes": [ { "name": "Color", "jsonPath": "ProductAttributes[Type=Color].Value", "weight": 1.5, "useForMatching": true, "wordReplacements": {"gray": "grey", "navy": "navy blue"} }, { "name": "Size", "jsonPath": "ProductAttributes[Type=Size].Value", "weight": 1.8, "useForMatching": true, "wordsToRemove": ["size", "us", "eu"], "normalizationRegex": "[^0-9XLS]", "normalizationReplacement": "" }, { "name": "Material", "jsonPath": "Details.Fabric.Primary", "weight": 1.2, "useForMatching": true } ], "includeOriginalValues": false } ### Example 3: Home & Garden Products json { "dataFormat": "json", "dataSource": "dataset", "dataset1": "bedbath", "dataset1Name": "BedBath", "dataset1PrimaryKey": "ProductId", "dataset2": "overstock", "dataset2Name": "Overstock", "dataset2PrimaryKey": "ProductId", "threshold": "0.6", "language": "en", "csvSeparator": ",", "groupByAttribute": "Model", "maxMatches": 3, "attributes": [ { "name": "Model", "jsonPath": "AdhocDataAttributes[Name=Model].value", "weight": 1, "useForMatching": false }, { "name": "Color", "jsonPath": "AdhocDataAttributes[Name=Color].value", "weight": 2, "useForMatching": true, "wordReplacements": { "gray": "grey", "/": " " } }, { "name": "Size", "jsonPath": "AdhocDataAttributes[Name=Size].value", "weight": 3, "useForMatching": true, "regex": "\\D" }, { "name": "Shape", "jsonPath": "AdhocDataAttributes[Name=Shape].value", "weight": 1, "useForMatching": true } ], "dataset1OutputFields": [ "Address", "ProductName" ] } ## Advanced Configuration ### JSON Path Expressions - Dot notation: `"product.details.name"` - Array search: `"Attributes[Name=Color].Value"` - Nested arrays/objects for complex structures #### Complex Nested Structures `json { "ProductAttributes": [ {"Type": "Color", "Value": "Red"}, {"Type": "Size", "Value": "Large"}, {"Type": "Material", "Value": "Cotton"} ], "Details": { "Pricing": {"MSRP": 29.99, "Sale": 19.99}, "Specifications": {"Weight": "2.5 lbs"} } }` Corresponding JSON paths: - Color: `"ProductAttributes[Type=Color].Value"` - Size: `"ProductAttributes[Type=Size].Value"` - MSRP: `"Details.Pricing.MSRP"` - Weight: `"Details.Specifications.Weight"` ### Regular Expression Patterns - Size cleaning: remove non-digits `{"regex": "\\D"}` - Model normalization: keep alphanumeric `{"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}` - Price extraction: strip currency symbols `{"regex": "[^0-9.]"}` #### Size Normalization `json { "name": "size", "regex": "\\D", "normalizationRegex": "[^0-9XLS]", "normalizationReplacement": "" }` - `regex`: Removes all non-digit characters during preprocessing - `normalizationRegex`: For similarity calculation, keeps only numbers and X, L, S #### Model Number Cleaning `json { "name": "model", "regex": "\\b(model|version|v\\d+)\\b", "normalizationRegex": "[^a-zA-Z0-9]", "normalizationReplacement": "" }` - Removes common model prefixes - Normalizes to alphanumeric only for comparison #### Price Extraction `json { "name": "price", "regex": "[^0-9.]", "normalizationRegex": "\\$|,", "normalizationReplacement": "" }` - Extracts numeric price values - Removes currency symbols and commas #### Brand Standardization `json { "name": "brand", "regex": "\\b(inc|llc|ltd|corp|company)\\b", "wordReplacements": { "apple": "apple inc", "hp": "hewlett packard", "ms": "microsoft" } }` ### Performance Optimization - Grouping by attribute reduces N×M comparisons to subsets - Note Ensure the group by field if in nested JSON is also included in the attributes - Use English model (`all-MiniLM-L6-v2`) for English-only to speed up - Limit `maxMatches` for large catalogs - Disable matching (`useForMatching: false`) on grouping fields #### Grouping Strategy Use `groupByAttribute` to partition products into smaller groups: `json { "groupByAttribute": "category", "attributes": [ { "name": "category", "weight": 0.5, "useForMatching": false } ] }` Benefits: - Reduces comparison matrix size from N×M to smaller subsets - Improves processing speed significantly for large datasets - More accurate matches within similar product categories #### Language Model Selection Choose appropriate models based on your data: - English: `"en"` - Fastest, best for English-only data - Multilingual: `"multilingual"` - Slower but handles mixed languages - Specific Languages: `"es"`, `"fr"`, `"de"` - Optimized for specific languages ## Output Format The Actor generates matches with the following structure: `json { "Dataset1ProductId": "PROD123", "Dataset2ProductId": "SKU456", "overallSimilarity": 0.85, "titleSimilarity": 0.92, "brandSimilarity": 1.0, "colorSimilarity": 0.75, "Dataset1Title": "Apple iPhone 13 Pro", "Dataset2Title": "iPhone 13 Pro - Apple", "Dataset1Brand": "Apple", "Dataset2Brand": "Apple Inc" }` ### Reading the SUMMARY After execution, a `SUMMARY` record is saved to KeyValueStore containing: - Total products per dataset - Number of matches and unique matches - Match rate - Model and data format used - Any collected errors with `type`, `code`, `message`, and `suggestions` Review this summary to diagnose configuration or data issues quickly. ## Best Practices - Attribute Weighting: - High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs) - Medium Weight (0.8-1.2): Important descriptors (brand, title) - Low Weight (0.3-0.7): Secondary attributes (color, price) - Threshold Selection: - High Precision (0.8-0.9): Few false positives, may miss some matches - Balanced (0.6-0.8): Good balance of precision and recall - High Recall (0.4-0.6): Catches more matches, requires manual review - Text Preprocessing: 1. Start with simple `wordReplacements` 2. Add `regex` for cleaning patterns 3. Use `normalizationRegex` only for similarity calculation 4. Validate on sample data - Scaling to Large Datasets: - Always use `groupByAttribute` when > 10,000 items - Adjust `maxMatches` and disable output of original values to reduce output dataset size ## Troubleshooting \& Error Handling ### Common Issues - No matches found - Lower the `threshold` value - Verify attribute names and JSON paths - Adjust text preprocessing rules - Too many false positives - Increase `threshold` to 0.8–0.9 - Add stricter `wordsToRemove` or regex - Increase weights for unique identifiers - Performance bottlenecks - Enable `groupByAttribute` for large datasets - Use the English model for English-only data - Reduce `maxMatches` ### Error Types This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY. | Error Class | Code | Description | | :-- | :-- | :-- | | InputValidationError | PME-100 | Schema or type validation failed for actor input | | DataLoadingError | PME-200 | CSV/JSON file not found, unreadable, or unparseable | | AttributeConfigError | PME-300 | Issues in the `attributes` section (missing columns, bad JSON paths, invalid weights) | | ModelLoadingError | PME-400 | Sentence-Transformer model fetch or cache failure | | ProcessingError | PME-500 | Failures during matching workflow (e.g., zero vectors, similarity computation errors) |

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Advanced Product Matcher Pro now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: datawhisperers
Pricing: Paid
Total Runs: 26
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support