Advanced Product Matcher Pro

by datawhisperers

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similar...

26 runs
2 users
Try This Actor

Opens on Apify.com

About Advanced Product Matcher Pro

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

What does this actor do?

Advanced Product Matcher Pro is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

AI Product Matcher Actor A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation. ## Features - Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets - Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore - Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching - Configurable Attributes: Weight different product attributes based on importance - Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization - Performance Optimization: Group products by categories or other attributes for faster processing - Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models - Flexible Output: Customizable match results with similarity scores, original values, and additional output fields - Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors ## Quick Start ### Basic Configuration Example json { "dataFormat": "csv", "dataSource": "datasets", "dataset1": "catalog_products", "dataset1Name": "Catalog", "dataset1PrimaryKey": "ProductId", "dataset2": "retailer_products", "dataset2Name": "Retailer", "dataset2PrimaryKey": "ProductId", "threshold": 0.7, "maxMatches": 2, "language": "en", "groupByAttribute": "category", "csvSeparator": ",", "includeOriginalValues": true, "attributes": [ { "name": "title", "weight": 1.0, "useForMatching": true }, { "name": "brand", "weight": 0.8, "useForMatching": true }, { "name": "price", "weight": 0.3, "useForMatching": false } ] } ### Core Input Parameters | Parameter | Type | Description | Default | | :-- | :-- | :-- | :-- | | dataFormat | string | Data format: "csv" or "json" | "json" | | dataSource | string | Source type: "datasets" or "keyvaluestore" | "datasets" | | keyValuestoreNameOrId | string | Name or ID of KeyValueStore (if dataSource: keyvaluestore) | none | | dataset1 | string | First dataset key/ID (CSV filename or Dataset ID) | required | | dataset1Name | string | Friendly name for dataset 1 | "Dataset1" | | dataset1PrimaryKey | string | Primary key field name in dataset 1 | "ProductId" | | dataset2 | string | Second dataset key/ID | required | | dataset2Name | string | Friendly name for dataset 2 | "Dataset2" | | dataset2PrimaryKey | string | Primary key field name in dataset 2 | "ProductId" | | threshold | number | Minimum overall similarity score for matches (0.0–1.0) | 0.5 | | maxMatches | integer | Maximum number of matches returned per item | 2 | | language | string | Embedding model selection: "en", "multilingual", "es", "fr", "de", "it", "pt", "nl" | "en" | | groupByAttribute | string | Attribute name to group by for efficient matching (optional) | none | | csvSeparator | string | CSV delimiter (only when dataFormat: csv) | "," | | includeOriginalValues | boolean | Include original attribute values in the output records | true | | dataset1OutputFields | array | Include specific attribute values in the output records from dataset 1 | ["Field1"] | | dataset2OutputFields | array | Include specific attribute values in the output records from dataset 2 | ["Field1", "Field2"] | | attributes | array | Required. List of attribute configurations (see below) | required | ### Attribute Configuration Each attribute in attributes supports: - name (string, required) — Column name (CSV) or attribute key (JSON) - weight (number) — Importance weight for matching (higher = more important) - useForMatching (boolean) — Whether to include in similarity calculation - jsonPath (string) — JSON path expression for nested data - wordsToRemove (array) — List of words to strip before matching - wordReplacements (object) — Mapping of terms to replace prior to matching - regex (string) — Regex to apply during preprocessing - normalizationRegex (string) — Regex applied before similarity calculation - normalizationReplacement (string) — Replacement for normalization regex #### Text Preprocessing example json { "name": "brand", "weight": 0.8, "useForMatching": true, "wordsToRemove": ["inc", "llc", "ltd", "corp"], "wordReplacements": { "apple": "apple inc", "samsung": "samsung electronics" }, "regex": "\\b(inc|llc|ltd|corp)\\b", "normalizationRegex": "[^a-zA-Z0-9\\s]", "normalizationReplacement": "" } | Property | Type | Description | |----------|------|-------------| | wordsToRemove | array | Words to remove from text | | wordReplacements | object | Word substitution mapping | | regex | string | Regex pattern for text cleaning | | normalizationRegex | string | Regex for similarity calculation normalization | | normalizationReplacement | string | Replacement for normalization regex | ## Real-World Examples ### 1. E-commerce Catalog Matching json { "dataFormat": "csv", "dataSource": "datasets", "dataset1": "manufacturer_catalog.csv", "dataset1Name": "Manufacturer", "dataset1PrimaryKey": "ProductId", "dataset2": "retailer_inventory.csv", "dataset2Name": "Retailer", "dataset2PrimaryKey": "ProductId", "threshold": 0.75, "maxMatches": 3, "language": "en", "groupByAttribute": "category", "attributes": [ { "name": "product_name", "weight": 1.5, "useForMatching": true, "wordsToRemove": ["new", "original", "authentic"], "wordReplacements": {"&": "and", "w/": "with"} }, { "name": "brand", "weight": 1.2, "useForMatching": true, "wordsToRemove": ["inc", "llc", "corp"], "wordReplacements": {"apple": "apple inc", "hp": "hewlett packard"} }, { "name": "model_number", "weight": 1.8, "useForMatching": true, "normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": "" }, { "name": "price", "weight": 0.3, "useForMatching": false, "regex": "\\D" } ] } ### 2. Fashion Product Matching with Complex JSON Matching fashion products from different suppliers with nested JSON data: json { "dataFormat": "json", "dataSource": "datasets", "dataset1": "fashion_supplier_a", "dataset1Name": "SupplierA", "dataset1PrimaryKey": "ID", "dataset2": "fashion_supplier_b", "dataset2Name": "SupplierB", "dataset2PrimaryKey": "ID", "threshold": 0.65, "language": "multilingual", "maxMatches": 2, "attributes": [ { "name": "Color", "jsonPath": "ProductAttributes[Type=Color].Value", "weight": 1.5, "useForMatching": true, "wordReplacements": {"gray": "grey", "navy": "navy blue"} }, { "name": "Size", "jsonPath": "ProductAttributes[Type=Size].Value", "weight": 1.8, "useForMatching": true, "wordsToRemove": ["size", "us", "eu"], "normalizationRegex": "[^0-9XLS]", "normalizationReplacement": "" }, { "name": "Material", "jsonPath": "Details.Fabric.Primary", "weight": 1.2, "useForMatching": true } ], "includeOriginalValues": false } ### Example 3: Home & Garden Products json { "dataFormat": "json", "dataSource": "dataset", "dataset1": "bedbath", "dataset1Name": "BedBath", "dataset1PrimaryKey": "ProductId", "dataset2": "overstock", "dataset2Name": "Overstock", "dataset2PrimaryKey": "ProductId", "threshold": "0.6", "language": "en", "csvSeparator": ",", "groupByAttribute": "Model", "maxMatches": 3, "attributes": [ { "name": "Model", "jsonPath": "AdhocDataAttributes[Name=Model].value", "weight": 1, "useForMatching": false }, { "name": "Color", "jsonPath": "AdhocDataAttributes[Name=Color].value", "weight": 2, "useForMatching": true, "wordReplacements": { "gray": "grey", "/": " " } }, { "name": "Size", "jsonPath": "AdhocDataAttributes[Name=Size].value", "weight": 3, "useForMatching": true, "regex": "\\D" }, { "name": "Shape", "jsonPath": "AdhocDataAttributes[Name=Shape].value", "weight": 1, "useForMatching": true } ], "dataset1OutputFields": [ "Address", "ProductName" ] } ## Advanced Configuration ### JSON Path Expressions - Dot notation: "product.details.name" - Array search: "Attributes[Name=Color].Value" - Nested arrays/objects for complex structures #### Complex Nested Structures json { "ProductAttributes": [ {"Type": "Color", "Value": "Red"}, {"Type": "Size", "Value": "Large"}, {"Type": "Material", "Value": "Cotton"} ], "Details": { "Pricing": {"MSRP": 29.99, "Sale": 19.99}, "Specifications": {"Weight": "2.5 lbs"} } } Corresponding JSON paths: - Color: "ProductAttributes[Type=Color].Value" - Size: "ProductAttributes[Type=Size].Value" - MSRP: "Details.Pricing.MSRP" - Weight: "Details.Specifications.Weight" ### Regular Expression Patterns - Size cleaning: remove non-digits {"regex": "\\D"} - Model normalization: keep alphanumeric {"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""} - Price extraction: strip currency symbols {"regex": "[^0-9.]"} #### Size Normalization json { "name": "size", "regex": "\\D", "normalizationRegex": "[^0-9XLS]", "normalizationReplacement": "" } - regex: Removes all non-digit characters during preprocessing - normalizationRegex: For similarity calculation, keeps only numbers and X, L, S #### Model Number Cleaning json { "name": "model", "regex": "\\b(model|version|v\\d+)\\b", "normalizationRegex": "[^a-zA-Z0-9]", "normalizationReplacement": "" } - Removes common model prefixes - Normalizes to alphanumeric only for comparison #### Price Extraction json { "name": "price", "regex": "[^0-9.]", "normalizationRegex": "\\$|,", "normalizationReplacement": "" } - Extracts numeric price values - Removes currency symbols and commas #### Brand Standardization json { "name": "brand", "regex": "\\b(inc|llc|ltd|corp|company)\\b", "wordReplacements": { "apple": "apple inc", "hp": "hewlett packard", "ms": "microsoft" } } ### Performance Optimization - Grouping by attribute reduces N×M comparisons to subsets - Note Ensure the group by field if in nested JSON is also included in the attributes - Use English model (all-MiniLM-L6-v2) for English-only to speed up - Limit maxMatches for large catalogs - Disable matching (useForMatching: false) on grouping fields #### Grouping Strategy Use groupByAttribute to partition products into smaller groups: json { "groupByAttribute": "category", "attributes": [ { "name": "category", "weight": 0.5, "useForMatching": false } ] } Benefits: - Reduces comparison matrix size from N×M to smaller subsets - Improves processing speed significantly for large datasets - More accurate matches within similar product categories #### Language Model Selection Choose appropriate models based on your data: - English: "en" - Fastest, best for English-only data - Multilingual: "multilingual" - Slower but handles mixed languages - Specific Languages: "es", "fr", "de" - Optimized for specific languages ## Output Format The Actor generates matches with the following structure: json { "Dataset1ProductId": "PROD123", "Dataset2ProductId": "SKU456", "overallSimilarity": 0.85, "titleSimilarity": 0.92, "brandSimilarity": 1.0, "colorSimilarity": 0.75, "Dataset1Title": "Apple iPhone 13 Pro", "Dataset2Title": "iPhone 13 Pro - Apple", "Dataset1Brand": "Apple", "Dataset2Brand": "Apple Inc" } ### Reading the SUMMARY After execution, a SUMMARY record is saved to KeyValueStore containing: - Total products per dataset - Number of matches and unique matches - Match rate - Model and data format used - Any collected errors with type, code, message, and suggestions Review this summary to diagnose configuration or data issues quickly. ## Best Practices - Attribute Weighting: - High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs) - Medium Weight (0.8-1.2): Important descriptors (brand, title) - Low Weight (0.3-0.7): Secondary attributes (color, price) - Threshold Selection: - High Precision (0.8-0.9): Few false positives, may miss some matches - Balanced (0.6-0.8): Good balance of precision and recall - High Recall (0.4-0.6): Catches more matches, requires manual review - Text Preprocessing: 1. Start with simple wordReplacements 2. Add regex for cleaning patterns 3. Use normalizationRegex only for similarity calculation 4. Validate on sample data - Scaling to Large Datasets: - Always use groupByAttribute when > 10,000 items - Adjust maxMatches and disable output of original values to reduce output dataset size ## Troubleshooting \& Error Handling ### Common Issues - No matches found - Lower the threshold value - Verify attribute names and JSON paths - Adjust text preprocessing rules - Too many false positives - Increase threshold to 0.8–0.9 - Add stricter wordsToRemove or regex - Increase weights for unique identifiers - Performance bottlenecks - Enable groupByAttribute for large datasets - Use the English model for English-only data - Reduce maxMatches ### Error Types This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY. | Error Class | Code | Description | | :-- | :-- | :-- | | InputValidationError | PME-100 | Schema or type validation failed for actor input | | DataLoadingError | PME-200 | CSV/JSON file not found, unreadable, or unparseable | | AttributeConfigError | PME-300 | Issues in the attributes section (missing columns, bad JSON paths, invalid weights) | | ModelLoadingError | PME-400 | Sentence-Transformer model fetch or cache failure | | ProcessingError | PME-500 | Failures during matching workflow (e.g., zero vectors, similarity computation errors) |

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Advanced Product Matcher Pro now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
datawhisperers
Pricing
Paid
Total Runs
26
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support