Product Matching Vectorizer

Name: Product Matching Vectorizer
Author: tri_angle

by tri_angle

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for ...

1 users

Try This Actor

Opens on Apify.com

About Product Matching Vectorizer

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.

What does this actor do?

Product Matching Vectorizer is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Product Matching Vectorizer - Apify Actor Builds a FAISS vector database from products in an Apify dataset using a fine-tuned ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. ## Overview This actor: 1. Loads products from an Apify dataset (using pagination) 2. Extracts fields using flexible dot-notation mapping 3. Generates 384-dimensional embeddings using fine-tuned ONNX model 4. Builds a FAISS index for similarity search 5. Saves index + metadata to a named Key-Value Store Key Features: - Flexible mapping: Extract nested fields with dot notation (e.g., `product.name.translated`) - Metadata options: Store full items or selective fields - Migration recovery: Automatic checkpoint and resume on server migration - Batch processing: Efficient vectorization in configurable batches - Progress tracking: Real-time status updates with ETA ## Required vs Optional Fields The product matching model uses exactly 5 fields for generating embeddings: ### Required Fields These fields MUST be provided for every product: - `titlePath` - Product name/title - `brandPath` - Brand name - `categoryPath` - Product category ### Optional Fields These fields improve matching quality when available: - `descriptionPath` - Product description (highly recommended for all products) - `specificationsPath` - Technical specifications (highly recommended for technical products) When to use specifications: - Electronics - Screen size, processor, RAM, storage, etc. - Appliances - Dimensions, power, capacity, features - Technical gear - Materials, measurements, technical details - Fashion/Clothing - Size/color go in metadata, not embeddings - Simple products - Most non-technical products don't need this Important Notes: - The model does NOT use other fields like price, SKU, color, size, etc. for similarity matching - Additional fields can be stored in `metadataMapping` for retrieval, but won't affect matching - Missing required fields will generate warnings but won't stop processing - Set optional fields to `null` or omit them if not applicable to your products ## Input Parameters The actor accepts the following input parameters: json { "datasetId": "bp0kO9SGUQckUnDJb", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.details.taxonomy_type.group_name", "descriptionPath": "product.details.description.translated", "specificationsPath": null, "metadataMapping": { "title": "product.name.translated", "brand": "brand.name", "price": "product.options.{first}.retail_price_cents" }, "kvStoreName": "customer-products-index", "maxItems": null, "batchSize": 1000 } ### Required Parameters - `datasetId` (string): Apify dataset ID containing products to vectorize - `idField` (string): Dot-notation path to product ID field - Example: `"product.token"`, `"id"`, `"sku"` - `titlePath` (string): Path to product title field - `brandPath` (string): Path to brand name field - `categoryPath` (string): Path to category field - `kvStoreName` (string): Name of Key-Value Store to save the index ### Optional Parameters - `descriptionPath` (string, optional): Path to product description field - Highly recommended - significantly improves matching quality - `specificationsPath` (string, optional): Path to technical specifications - Recommended for technical products (electronics, appliances, etc.) - `metadataMapping` (object, optional): Fields to store as metadata - If not specified: Full dataset items are stored (preserves all data) - If specified: Only mapped fields are stored (compact, optimized) - Can include any fields (price, SKU, images, etc.) for retrieval - `maxItems` (integer, optional): Limit number of products (useful for testing) - `batchSize` (integer, default: 1000): Products to encode per batch - Larger batches = faster but more memory - Range: 1-10,000 - `debugMode` (boolean, default: false): Enable verbose debug logging - When enabled: Shows detailed data structure information and extraction paths - When disabled (recommended): Cleaner production logs with better security - ⚠️ Warning: Debug mode may expose internal data structures in logs ## Dot Notation Mapping ### Basic Syntax Extract nested fields using dot notation: `json { "title": "product.name.translated", "brand": "brand.name", "category": "product.details.taxonomy_type.group_name" }` Given this dataset item: `json { "product": { "name": {"translated": "Canvas Tote Bag"}, "details": { "taxonomy_type": {"group_name": "Bags & Totes"} } }, "brand": {"name": "EcoBrand"} }` Extracts: `json { "title": "Canvas Tote Bag", "brand": "EcoBrand", "category": "Bags & Totes" }` ### Special Syntax: `{first}` Use `{first}` to select the first key from a dictionary: `json { "price": "product.options.{first}.retail_price_cents" }` Given: `json { "product": { "options": { "opt_abc123": {"retail_price_cents": 2499}, "opt_def456": {"retail_price_cents": 3499} } } }` Extracts: `2499` (from first option) ### Null Values Set a field to `null` to explicitly omit it: `json { "title": "product.name", "specifications": null }` ## Embedding Model Uses a fine-tuned sentence transformer model optimized for product matching: - Base model: `sentence-transformers/all-MiniLM-L6-v2` - Fine-tuned: On product matching task - Format: ONNX for fast CPU inference - Embedding dimension: 384 - Normalization: L2-normalized (cosine similarity via inner product) ### Embedding Format Products are formatted before encoding: `title: {title} | brand: {brand} | category: {category} | desc: {description} | spec: {specifications}` Important: Price is NOT included in embeddings (it's metadata only). ## Output Format The actor saves two files to the specified Key-Value Store: ### 1. `index.faiss` (Binary) FAISS IndexFlatIP (Inner Product) containing normalized embeddings. - Type: Inner Product index (cosine similarity for normalized vectors) - Usage: Load with `faiss.deserialize_index(bytes)` ### 2. `metadata.json` (JSON) Complete metadata about the index: `json { "version": "1.0", "created_at": "2025-10-30T19:00:00Z", "total_products": 104321, "embedding_dim": 384, "model": "product-matcher-onnx", "embedding_mapping": { "title": "product.name.translated", "brand": "brand.name" }, "metadata_mapping": { "title": "product.name.translated", "price": "product.options.{first}.retail_price_cents" }, "ids": ["p_123", "p_456", ...], "metadata": [ {"title": "Canvas Tote", "price": 2499}, {"title": "Water Bottle", "price": 1999}, ... ] }` Fields: - `version`: Metadata schema version - `created_at`: UTC timestamp - `total_products`: Number of products in index - `embedding_dim`: Vector dimension (384) - `model`: Model identifier - `embedding_mapping`: Mapping used for embeddings - `metadata_mapping`: Mapping used for metadata (or `null` if full items) - `ids`: Array of product IDs (same order as FAISS index) - `metadata`: Array of product metadata (same order as FAISS index) ## Usage Examples ### Example 1: Minimal Configuration (Full Metadata) Store full dataset items as metadata, required fields only: `json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "category.name", "kvStoreName": "products-full" }` Result: `metadata.json` contains full dataset items (preserves all data). ### Example 2: Complete Configuration with Descriptions Include all fields for best matching quality: json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.details.taxonomy_type.group_name", "descriptionPath": "product.details.description.translated", "specificationsPath": "product.specifications", "metadataMapping": { "title": "product.name.translated", "brand": "brand.name", "category": "product.details.taxonomy_type.group_name", "price": "product.options.{first}.retail_price_cents", "image": "product.image_url" }, "kvStoreName": "products-complete" } Result: Best embedding quality + compact metadata with price and image. ### Example 3: Products Without Specifications For non-technical products (clothing, home goods, etc.) that don't have specifications: `json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.category", "descriptionPath": "product.description", "specificationsPath": null, "kvStoreName": "fashion-products" }` Note: `specificationsPath` set to `null` (or omitted) since fashion products typically don't have technical specs. ### Example 4: Testing with Limit Test with 100 products (minimal setup): `json { "datasetId": "abc123", "idField": "id", "titlePath": "name", "brandPath": "brand", "categoryPath": "category", "kvStoreName": "test-index", "maxItems": 100, "batchSize": 50 }` ## Using the Generated Index ### Python Example python from apify_client import ApifyClient import faiss import numpy as np import json # Initialize Apify client client = ApifyClient("YOUR_API_TOKEN") # Get KV store kv_store = client.key_value_store("customer-products-index") # Load index index_bytes = kv_store.get_record("index.faiss")["value"] index = faiss.deserialize_index(np.frombuffer(index_bytes, dtype=np.uint8)) # Load metadata metadata = kv_store.get_record("metadata.json")["value"] ids = metadata["ids"] product_metadata = metadata["metadata"] print(f"Loaded index with {index.ntotal} products") # Search example (assuming you have a query embedding) query_embedding = ... # 384-dim vector, L2-normalized k = 5 # Top 5 results similarities, indices = index.search( query_embedding.reshape(1, -1).astype('float32'), k ) # Get results for rank, (sim, idx) in enumerate(zip(similarities[0], indices[0])): product_id = ids[idx] meta = product_metadata[idx] print(f"{rank+1}. {meta['title']} (similarity: {sim:.3f})") ## Migration Recovery The actor automatically handles server migrations: 1. State Persistence: Progress is saved on `PERSIST_STATE` events 2. Batch Checkpoints: In-progress batches are saved before migration 3. Auto-Resume: On restart, actor resumes from last checkpoint 4. No Data Loss: All processed embeddings are preserved State stored in default KV store: - `vectorizer-state`: Progress tracking - `vectorizer-batch-checkpoint`: In-progress batch data These are automatically cleaned up on successful completion. ## Performance The actor is optimized for efficient processing: - Fast model loading and initialization - Efficient batch vectorization - Quick FAISS index building - Memory usage scales with batch size Optimization tips: - Increase `batchSize` for faster processing (up to 10,000) - Use selective `metadataMapping` to reduce memory usage - For very large datasets (>1M products), consider chunking ## Files ### Core Actor Files - `src/main.py` - Main actor entry point with migration recovery - `src/vectorizer.py` - ONNX vectorizer wrapper - `src/mapping.py` - Dot-notation field extraction - `src/preprocessing.py` - Text preprocessing utilities ### Configuration - `.actor/actor.json` - Actor metadata - `.actor/input_schema.json` - Input parameter schema - `Dockerfile` - Container definition - `requirements.txt` - Python dependencies ### Model Files - `models/product-matcher-onnx/` - ONNX model files - `model.onnx` - Optimized inference model - `tokenizer.json` - Tokenizer configuration - `pooling_config.json` - Pooling configuration ## Deployment 1. Configure input: Set dataset ID, mappings, and KV store name 2. Run actor: Via Apify Console or API 3. Monitor progress: Real-time status updates with ETA 4. Retrieve index: Access from specified KV store Deploy to Apify: `bash apify push` ## Troubleshooting ### Missing ID Field Error: `Missing or empty ID field at path: product.token` Solution: Check that `idField` path is correct and all items have IDs. ### Empty Dataset Warning: `No items found in dataset` Solution: Verify dataset ID and that it contains items. ### Invalid Mapping Error: `embeddingMapping cannot be empty` Solution: Provide at least one field in `embeddingMapping`. ### Memory Issues Error: Out of memory during batch processing Solution: Reduce `batchSize` (try 500 or 250). ## Related Actors - Product Matcher: Uses this index to find similar products - Product Scraper: Collects products for indexing ## Support For issues or questions, please create an issue in the repository. ## License MIT

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Product Matching Vectorizer now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: tri_angle
Pricing: Paid
Active Users: 1

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support