PGVector Integration
by apify
Effortlessly transfer data from Apify actors to a Postgres database with PGVector for vector search. Perfect for building RAG apps and similarity search pipelines.
Opens on Apify.com
About PGVector Integration
Need to get your scraped web data into a Postgres database with vector search capabilities? This actor is for you. It’s a straightforward integration that moves data directly from your Apify actors into a Postgres database that has the PGVector extension installed. Think of it as a reliable bridge between your data collection pipelines and a powerful, searchable vector store. I use it when I've scraped product catalogs or document sets and need to run similarity searches or build a RAG application without a complicated setup. It handles the connection and data mapping, so you can focus on querying your data with SQL and vector operations. The main benefit is simplicity—you configure your dataset and database credentials, and it handles the transfer. It’s perfect for developers who want to combine Apify's scraping strength with the flexibility of Postgres for AI-driven data analysis, all while keeping their stack open-source and under their control.
What does this actor do?
PGVector Integration is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
PGVector Integration
Transfers data from Apify Actors to PostgreSQL with the PGVector extension. It processes datasets, optionally splits text into chunks, computes embeddings, and stores them for efficient search and retrieval, particularly useful for Retrieval Augmented Generation (RAG) applications.
Overview
This Actor is an integration tool designed to work with other Apify Actors. For example, you can connect it to the Website Content Crawler to automatically save crawled web content as vector embeddings in your PostgreSQL database. It uses LangChain for text processing and embedding computation.
Key Features
- Vector Storage: Computes text embeddings and stores them in a PostgreSQL database with the PGVector extension.
- Incremental Updates: Can be configured to update only changed data, reducing unnecessary compute and storage operations.
- Text Chunking: Optionally splits long text into smaller chunks using LangChain's
RecursiveCharacterTextSplitterfor better embedding quality. - Embedding Provider Flexibility: Supports multiple providers like OpenAI and Cohere for generating embeddings.
- Dataset Field Mapping: Lets you specify which dataset fields to store as content and which to keep as metadata.
How to Use
This integration runs automatically when configured within another Actor's integration settings. You don't run it as a standalone task.
-
Prerequisites:
- A PostgreSQL database with the PGVector extension installed.
- Your database connection string (
postgresSqlConnectionStr) and a target collection/table name (postgresCollectionName). - An API key for your chosen embeddings provider (e.g., from OpenAI).
-
Configuration: In your source Actor (e.g., Website Content Crawler), activate the PGVector integration and provide the required input. The main configuration sections are:
- Database Connection: Your PostgreSQL credentials and target collection.
- Embeddings Provider: Your chosen provider (e.g.,
OpenAIEmbeddings) and its API key/model settings. - Data Mapping: Which fields from the source dataset contain the text and metadata.
-
Process Flow:
- The integration fetches the dataset from the source Actor.
- (Optional) It splits the text data into chunks based on your
chunkSizeandchunkOverlapsettings. - (Optional) It identifies and processes only new or modified data if incremental updates are enabled.
- It sends the text to the configured embeddings API to compute vector representations.
- It saves the vectors, along with the original text and any metadata, to your PostgreSQL database.
Input / Output
Input Schema
The integration is configured via input fields when setting it up on a source Actor. Full details are on the Input page.
Essential Configuration Example:
{
"postgresSqlConnectionStr": "postgresql://user:password@host:5432/dbname",
"postgresCollectionName": "my_docs",
"embeddingsProvider": "OpenAIEmbeddings",
"embeddingsApiKey": "your-openai-key",
"embeddingsConfig": {"model": "text-embedding-3-small"},
"datasetFields": ["text"],
"metadataDatasetFields": {"url": "url", "title": "metadata.title"}
}
postgresSqlConnectionStr: Your PostgreSQL connection string.postgresCollectionName: The table name where vectors will be stored.embeddingsProvider&embeddingsApiKey: Defines which service computes your embeddings.datasetFields: An array specifying which dataset field(s) contain the main text to embed (e.g.,["text"]).metadataDatasetFields: A mapping of metadata column names to their source fields in the dataset.
Optional Settings:
* performChunking: Set to true to enable text splitting.
* chunkSize & chunkOverlap: Control the size and overlap of text chunks.
* dataUpdatesStrategy: Choose how to handle updates (e.g., incremental).
Output
The Actor does not produce a separate dataset. Its output is the populated (or updated) PostgreSQL table, which contains columns for the vector embeddings, the source text, and the mapped metadata. Ensure your PostgreSQL vector column dimension matches the output size of your chosen embedding model (e.g., 1536 for text-embedding-3-small).
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try PGVector Integration now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- apify
- Pricing
- Paid
- Total Runs
- 103
- Active Users
- 16
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support