SEO Keyword Extractor
by wisteria_banjo
Finds keyword phrases from a list of websites 🌐, groups similar ones into clear themes 🧩, and ranks them. Also suggests good main keywords ⭐ and possi...
Opens on Apify.com
About SEO Keyword Extractor
Finds keyword phrases from a list of websites 🌐, groups similar ones into clear themes 🧩, and ranks them. Also suggests good main keywords ⭐ and possible negative keywords 🚫 so you can plan SEO and ad campaigns in a smarter, more focused way 📈.
What does this actor do?
SEO Keyword Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
🔍 SEO Keyword Theme & Negative Keyword Analyzer 🚀 ## 📘 Overview This actor takes one or more URLs, extracts high-value multi-word SEO keyphrases, and then: - Clusters common cross-site keyword families (semantic variants across multiple domains). - Computes n-gram stats (e.g. “real estate lawyer”, “fort lauderdale real estate lawyer”) only for phrases that show up on multiple sites. - Builds keyword themes (ranked topics with all their variants and sites). - Suggests candidate negative keywords (likely competitor names / one-off phrases that only appear on a single site). It’s built for serious competitive research, PPC planning, and semantic SEO clustering across your niche 🌐✨ ## 🌟 Use Cases | 💼 Scenario | 📈 Benefit | |------------|------------| | 🔎 Competitor keyword intelligence | See which phrases multiple competitors converge on (strong themes) vs. one-off phrases (weak or brand-specific). | | 🧩 Local + practice-area SEO | Quickly surface geo + service combos like “fort lauderdale real estate lawyer” or “west palm beach probate attorney.” | | 🧠 Semantic clustering & topic planning | Get “keyword themes” with a primary phrase, all variants, and which sites use them. | | 🎯 PPC campaign & ad group design | Use themes as ad groups and variants as match types; use single-site phrases as negative keyword candidates. | | 🧹 Keyword cleanup & noise reduction | Filters out junky code-like phrases, numeric strings, and odd technical terms by default. | ## 🧪 Output Structure Results are written as flat dataset rows so they’re easy to export to CSV, Sheets, or BI tools. Each row has a record_type that tells you what kind of entity it is. ### 1️⃣ Per-page keywords One row per URL: json { "record_type": "page_keywords", "page_url": "https://example.com", "top_keywords": [ "west palm beach real estate attorney", "florida real estate lawyers", "business litigation fort lauderdale" ] } ### 2️⃣ Common cross-site keyword families Clusters of similar phrases that show up on more than one site, with similarity metrics: json { "record_type": "common_cross_site_keywords", "group_representative": "florida real estate attorney", "group_keywords": [ "florida real estate attorney", "florida real estate lawyers", "florida real estate law", "law florida real estate", "real estate litigation attorneys" ], "keyword_count": 5, "site_count": 3, "sites": [ "https://a.com", "https://b.com", "https://c.com" ], "levenshtein_avg_distance": 0.31, "levenshtein_max_distance": 0.53 } Use these rows to see: - Which concepts recur across domains (`site_count`). - How tight the wording cluster is (lower Levenshtein distances = more similar). ### 3️⃣ N-gram stats (cross-site phrases) For each n (2, 3, …), the actor aggregates n-grams that appear on **at least 3 different sites** (strong cross-site themes):json { "record_type": "ngram_3", "ngram": "fort lauderdale real", "n": 3, "count": 8, "site_count": 4, "sites": [ "https://a.com", "https://b.com", "https://c.com", "https://d.com" ], "sample_keywords": [ "fort lauderdale real estate", "lauderdale real estate lawyer", "lauderdale real estate attorneys" ] } This is great for spotting **standard phrases** in the market (“real estate lawyer”, “west palm beach”, etc.). ### 4️⃣ Group-to-group similarity (Jaccard) When two cross-site keyword families heavily overlap in their token sets, they’re connected with a Jaccard score:json { "record_type": "group_similarity", "group_a": "florida real estate attorney", "group_b": "real estate lawyer", "similarity": 0.63 } These tell you which keyword families are basically talking about the same thing and should probably be treated as one theme in your planning. ### 5️⃣ Keyword themes (the “use this in campaigns” layer) Themes merge similar groups into higher-level topics and rank them:json { "record_type": "keyword_theme", "primary_keyword": "florida real estate attorney", "score": 0.95, "site_count": 3, "groups_in_theme": 2, "all_variants": [ "florida real estate attorney", "florida real estate law", "florida real estate lawyers", "law florida real estate", "real estate litigation attorneys" ], "all_sites": [ "https://a.com", "https://b.com", "https://c.com" ] } **How to use these:** - Treat each `keyword_theme` as: - A **core SEO topic / pillar page**, or - A **PPC ad group** (primary = ad group name, variants = match types / ad copy phrases). Higher `score` = stronger candidate. ### 6️⃣ Candidate negative keywords The actor also flags n-grams that only appear on **one site** as **negative keyword candidates** (often brand names or very specific, non-generic terms):json { "record_type": "negative_keyword_candidate", "phrase": "ryan shipp", "n": 2, "count": 3, "site_count": 1, "sites": [ "https://competitor.com" ], "reason": "single_site_ngram" } These are **not auto-applied negatives**. They’re **suggestions** that you should manually review before adding to a PPC negative list (especially competitor names or hyper-specific phrases you don’t want to pay for). ## ⚙️ Input ### Required fieldsjson { "urls": [ { "url": "https://example.com" }, { "url": "https://another-site.com" } ], "min_ngram_n": 2 } `` - **urls** (array) - Uses therequestListSourceseditor in Apify. - Accepts either{ "url": "..." }objects or plain strings"https://...". - **min_ngram_n** (integer, optional, default2) - The **minimum n-gram length** to analyze. -2= start at bigrams (“real estate”),3= only 3+ word phrases (“real estate lawyer”, “fort lauderdale real estate”). - Unigrams (single words) are **never** computed to keep noise down. Internally, the actor analyzes n-grams frommin_ngram_nup to a safe cap (currently6) to avoid combinatorial blow-ups on very long phrases. ## 🔄 How it works (under the hood) 1. **Fetch & clean** - Fetches each URL via HTTP. - Strips scripts, styles, and other noise and extracts visible text. 2. **Keyword extraction** - Uses a transformer-based model (all-MiniLM-L6-v2via KeyBERT) to extract multi-word keyphrases from the page content. - Filters out: - Numeric strings - Code-y / technical junk - Blacklisted tokens (e.g., obvious non-SEO boilerplate) - Keeps the most relevant 2–4 word keyphrases per page. 3. **Cross-site aggregation** - Clusters similar phrases across sites using RapidFuzz (token-set similarity). - Keeps only clusters seen on **multiple domains**. - Computes Levenshtein distances inside each cluster to quantify how tight/loose the variants are. 4. **N-gram analysis** - Builds n-gram stats across pages: - Only n in[min_ngram_n, 6]. - Only n-grams seen on **≥ 3 sites** are kept as strong cross-site themes. 5. **Theme building** - Builds a graph of keyword groups connected by high Jaccard similarity. - Collapses connected components into **themes**. - Scores each theme by: - Cross-site importance (how many sites use it). - Cohesion (Levenshtein-based). - Phrase length (favoring 2–4 word phrases). 6. **Negative keyword suggestions** - Separately scans all phrases for n-grams that appear on **exactly one site**. - Emits them asnegative_keyword_candidaterows for manual review. ## 💰 Monetization & Scaling This actor is designed to work cleanly with **Apify Pay-Per-Event (PPE)**: - **One event per run** –apify-actor-startCharge per actor start (each run). - **One event per result row** –apify-default-dataset-itemEveryActor.push_data(...)call creates a dataset item, which can be billed as a per-item event. That means: - Small runs with a few URLs → a handful of items → lower cost. - Large competitive sweeps (many domains) → more items (pages, cross-site keywords, themes, negatives) → higher cost but also richer insight. You can control cost by: - Limiting the number of input URLs. - Truncating or filtering which record types you care about (e.g., onlypage_keywords+keyword_theme). ## 🔄 Workflow Examples This actor is **workflow-ready** and plays nicely with other Apify tools: | 🔗 Integration | 🔍 Description | |----------------|----------------| |serp-scraper| Scrape top-ranking Google results for a query, then feed the URLs here to see the shared themes across the SERP. | |map-scraper` | Collect local business websites from Google Maps, then compare cross-site phrasing for local SEO campaigns. | | Other actors | Build end-to-end automations: harvest → extract → cluster → export to Sheets/Data Studio. | ## 🚀 Ready to Launch? Use this actor when you want more than just a list of keywords: - See which phrases truly define your niche (themes & n-grams). - Separate generic market language from brand-specific noise. - Build better SEO topics, tighter PPC ad groups, and smarter negative lists. Perfect for: - SEO agencies - Performance marketers - Local law firms & service businesses - Content strategists and SERP analysts Happy crawling & clustering! 🚀🌐
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try SEO Keyword Extractor now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- wisteria_banjo
- Pricing
- Paid
- Total Runs
- 20
- Active Users
- 4
Related Actors
Google Search Results Scraper
by apify
Google Search Results (SERP) Scraper
by scraperlink
Google Search
by devisty
Bing Search Scraper
by tri_angle
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support