Product Matching Vectorizer
by tri_angle
Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for ...
Opens on Apify.com
About Product Matching Vectorizer
Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.
What does this actor do?
Product Matching Vectorizer is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Product Matching Vectorizer - Apify Actor Builds a FAISS vector database from products in an Apify dataset using a fine-tuned ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. ## Overview This actor: 1. Loads products from an Apify dataset (using pagination) 2. Extracts fields using flexible dot-notation mapping 3. Generates 384-dimensional embeddings using fine-tuned ONNX model 4. Builds a FAISS index for similarity search 5. Saves index + metadata to a named Key-Value Store Key Features: - Flexible mapping: Extract nested fields with dot notation (e.g., product.name.translated) - Metadata options: Store full items or selective fields - Migration recovery: Automatic checkpoint and resume on server migration - Batch processing: Efficient vectorization in configurable batches - Progress tracking: Real-time status updates with ETA ## Required vs Optional Fields The product matching model uses exactly 5 fields for generating embeddings: ### Required Fields These fields MUST be provided for every product: - titlePath - Product name/title - brandPath - Brand name - categoryPath - Product category ### Optional Fields These fields improve matching quality when available: - descriptionPath - Product description (highly recommended for all products) - specificationsPath - Technical specifications (highly recommended for technical products) When to use specifications: - Electronics - Screen size, processor, RAM, storage, etc. - Appliances - Dimensions, power, capacity, features - Technical gear - Materials, measurements, technical details - Fashion/Clothing - Size/color go in metadata, not embeddings - Simple products - Most non-technical products don't need this Important Notes: - The model does NOT use other fields like price, SKU, color, size, etc. for similarity matching - Additional fields can be stored in metadataMapping for retrieval, but won't affect matching - Missing required fields will generate warnings but won't stop processing - Set optional fields to null or omit them if not applicable to your products ## Input Parameters The actor accepts the following input parameters: json { "datasetId": "bp0kO9SGUQckUnDJb", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.details.taxonomy_type.group_name", "descriptionPath": "product.details.description.translated", "specificationsPath": null, "metadataMapping": { "title": "product.name.translated", "brand": "brand.name", "price": "product.options.{first}.retail_price_cents" }, "kvStoreName": "customer-products-index", "maxItems": null, "batchSize": 1000 } ### Required Parameters - datasetId (string): Apify dataset ID containing products to vectorize - idField (string): Dot-notation path to product ID field - Example: "product.token", "id", "sku" - titlePath (string): Path to product title field - brandPath (string): Path to brand name field - categoryPath (string): Path to category field - kvStoreName (string): Name of Key-Value Store to save the index ### Optional Parameters - descriptionPath (string, optional): Path to product description field - Highly recommended - significantly improves matching quality - specificationsPath (string, optional): Path to technical specifications - Recommended for technical products (electronics, appliances, etc.) - metadataMapping (object, optional): Fields to store as metadata - If not specified: Full dataset items are stored (preserves all data) - If specified: Only mapped fields are stored (compact, optimized) - Can include any fields (price, SKU, images, etc.) for retrieval - maxItems (integer, optional): Limit number of products (useful for testing) - batchSize (integer, default: 1000): Products to encode per batch - Larger batches = faster but more memory - Range: 1-10,000 - debugMode (boolean, default: false): Enable verbose debug logging - When enabled: Shows detailed data structure information and extraction paths - When disabled (recommended): Cleaner production logs with better security - ⚠️ Warning: Debug mode may expose internal data structures in logs ## Dot Notation Mapping ### Basic Syntax Extract nested fields using dot notation: json { "title": "product.name.translated", "brand": "brand.name", "category": "product.details.taxonomy_type.group_name" } Given this dataset item: json { "product": { "name": {"translated": "Canvas Tote Bag"}, "details": { "taxonomy_type": {"group_name": "Bags & Totes"} } }, "brand": {"name": "EcoBrand"} } Extracts: json { "title": "Canvas Tote Bag", "brand": "EcoBrand", "category": "Bags & Totes" } ### Special Syntax: {first} Use {first} to select the first key from a dictionary: json { "price": "product.options.{first}.retail_price_cents" } Given: json { "product": { "options": { "opt_abc123": {"retail_price_cents": 2499}, "opt_def456": {"retail_price_cents": 3499} } } } Extracts: 2499 (from first option) ### Null Values Set a field to null to explicitly omit it: json { "title": "product.name", "specifications": null } ## Embedding Model Uses a fine-tuned sentence transformer model optimized for product matching: - Base model: sentence-transformers/all-MiniLM-L6-v2 - Fine-tuned: On product matching task - Format: ONNX for fast CPU inference - Embedding dimension: 384 - Normalization: L2-normalized (cosine similarity via inner product) ### Embedding Format Products are formatted before encoding: title: {title} | brand: {brand} | category: {category} | desc: {description} | spec: {specifications} Important: Price is NOT included in embeddings (it's metadata only). ## Output Format The actor saves two files to the specified Key-Value Store: ### 1. index.faiss (Binary) FAISS IndexFlatIP (Inner Product) containing normalized embeddings. - Type: Inner Product index (cosine similarity for normalized vectors) - Usage: Load with faiss.deserialize_index(bytes) ### 2. metadata.json (JSON) Complete metadata about the index: json { "version": "1.0", "created_at": "2025-10-30T19:00:00Z", "total_products": 104321, "embedding_dim": 384, "model": "product-matcher-onnx", "embedding_mapping": { "title": "product.name.translated", "brand": "brand.name" }, "metadata_mapping": { "title": "product.name.translated", "price": "product.options.{first}.retail_price_cents" }, "ids": ["p_123", "p_456", ...], "metadata": [ {"title": "Canvas Tote", "price": 2499}, {"title": "Water Bottle", "price": 1999}, ... ] } Fields: - version: Metadata schema version - created_at: UTC timestamp - total_products: Number of products in index - embedding_dim: Vector dimension (384) - model: Model identifier - embedding_mapping: Mapping used for embeddings - metadata_mapping: Mapping used for metadata (or null if full items) - ids: Array of product IDs (same order as FAISS index) - metadata: Array of product metadata (same order as FAISS index) ## Usage Examples ### Example 1: Minimal Configuration (Full Metadata) Store full dataset items as metadata, required fields only: json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "category.name", "kvStoreName": "products-full" } Result: metadata.json contains full dataset items (preserves all data). ### Example 2: Complete Configuration with Descriptions Include all fields for best matching quality: json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.details.taxonomy_type.group_name", "descriptionPath": "product.details.description.translated", "specificationsPath": "product.specifications", "metadataMapping": { "title": "product.name.translated", "brand": "brand.name", "category": "product.details.taxonomy_type.group_name", "price": "product.options.{first}.retail_price_cents", "image": "product.image_url" }, "kvStoreName": "products-complete" } Result: Best embedding quality + compact metadata with price and image. ### Example 3: Products Without Specifications For non-technical products (clothing, home goods, etc.) that don't have specifications: json { "datasetId": "abc123", "idField": "product.token", "titlePath": "product.name.translated", "brandPath": "brand.name", "categoryPath": "product.category", "descriptionPath": "product.description", "specificationsPath": null, "kvStoreName": "fashion-products" } Note: specificationsPath set to null (or omitted) since fashion products typically don't have technical specs. ### Example 4: Testing with Limit Test with 100 products (minimal setup): json { "datasetId": "abc123", "idField": "id", "titlePath": "name", "brandPath": "brand", "categoryPath": "category", "kvStoreName": "test-index", "maxItems": 100, "batchSize": 50 } ## Using the Generated Index ### Python Example python from apify_client import ApifyClient import faiss import numpy as np import json # Initialize Apify client client = ApifyClient("YOUR_API_TOKEN") # Get KV store kv_store = client.key_value_store("customer-products-index") # Load index index_bytes = kv_store.get_record("index.faiss")["value"] index = faiss.deserialize_index(np.frombuffer(index_bytes, dtype=np.uint8)) # Load metadata metadata = kv_store.get_record("metadata.json")["value"] ids = metadata["ids"] product_metadata = metadata["metadata"] print(f"Loaded index with {index.ntotal} products") # Search example (assuming you have a query embedding) query_embedding = ... # 384-dim vector, L2-normalized k = 5 # Top 5 results similarities, indices = index.search( query_embedding.reshape(1, -1).astype('float32'), k ) # Get results for rank, (sim, idx) in enumerate(zip(similarities[0], indices[0])): product_id = ids[idx] meta = product_metadata[idx] print(f"{rank+1}. {meta['title']} (similarity: {sim:.3f})") ## Migration Recovery The actor automatically handles server migrations: 1. State Persistence: Progress is saved on PERSIST_STATE events 2. Batch Checkpoints: In-progress batches are saved before migration 3. Auto-Resume: On restart, actor resumes from last checkpoint 4. No Data Loss: All processed embeddings are preserved State stored in default KV store: - vectorizer-state: Progress tracking - vectorizer-batch-checkpoint: In-progress batch data These are automatically cleaned up on successful completion. ## Performance The actor is optimized for efficient processing: - Fast model loading and initialization - Efficient batch vectorization - Quick FAISS index building - Memory usage scales with batch size Optimization tips: - Increase batchSize for faster processing (up to 10,000) - Use selective metadataMapping to reduce memory usage - For very large datasets (>1M products), consider chunking ## Files ### Core Actor Files - src/main.py - Main actor entry point with migration recovery - src/vectorizer.py - ONNX vectorizer wrapper - src/mapping.py - Dot-notation field extraction - src/preprocessing.py - Text preprocessing utilities ### Configuration - .actor/actor.json - Actor metadata - .actor/input_schema.json - Input parameter schema - Dockerfile - Container definition - requirements.txt - Python dependencies ### Model Files - models/product-matcher-onnx/ - ONNX model files - model.onnx - Optimized inference model - tokenizer.json - Tokenizer configuration - pooling_config.json - Pooling configuration ## Deployment 1. Configure input: Set dataset ID, mappings, and KV store name 2. Run actor: Via Apify Console or API 3. Monitor progress: Real-time status updates with ETA 4. Retrieve index: Access from specified KV store Deploy to Apify: bash apify push ## Troubleshooting ### Missing ID Field Error: Missing or empty ID field at path: product.token Solution: Check that idField path is correct and all items have IDs. ### Empty Dataset Warning: No items found in dataset Solution: Verify dataset ID and that it contains items. ### Invalid Mapping Error: embeddingMapping cannot be empty Solution: Provide at least one field in embeddingMapping. ### Memory Issues Error: Out of memory during batch processing Solution: Reduce batchSize (try 500 or 250). ## Related Actors - Product Matcher: Uses this index to find similar products - Product Scraper: Collects products for indexing ## Support For issues or questions, please create an issue in the repository. ## License MIT
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Product Matching Vectorizer now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- tri_angle
- Pricing
- Paid
- Active Users
- 1
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support