Website Key Pages Finder

Website Key Pages Finder

by yummy_gelato

Find key pages (pricing, docs, status, security, privacy, terms) on any website. Crawls start URLs and returns structured URLs with confidence scores ...

1 runs
1 users
Try This Actor

Opens on Apify.com

About Website Key Pages Finder

Find key pages (pricing, docs, status, security, privacy, terms) on any website. Crawls start URLs and returns structured URLs with confidence scores and evidence. Great for competitor analysis, lead enrichment, and audits.

What does this actor do?

Website Key Pages Finder is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Website Key Pages Finder Automatically find and scrape key pages from any website: pricing pages, documentation, status pages, security information, privacy policies, and terms of service. This Apify Actor crawls websites intelligently and returns structured data with confidence scores for each discovered page. ## 🔍 What does Website Key Pages Finder do? Website Key Pages Finder is an Apify Actor that automatically discovers important pages on any website. Given a list of URLs, it crawls each site and returns the URLs of six key page types along with confidence scores and evidence explaining how each page was found. Key page types discovered: - Pricing - Plans, costs, and billing information - Documentation - API docs, guides, and developer resources - Status - System uptime and incident pages - Security - Trust centers, compliance, and security policies - Privacy - Privacy policies and data protection information - Terms - Terms of service and legal agreements The Actor uses a multi-phase discovery approach that combines URL pattern probing, homepage link extraction, and intelligent crawling to find pages even on sites with non-standard structures. ## 🎯 Why scrape key pages from websites? Finding key pages manually across dozens or hundreds of websites is time-consuming and error-prone. This Actor automates the process, making it valuable for: - Competitor Analysis - Quickly gather pricing pages and documentation from competitor websites to understand their offerings and positioning - Sales Intelligence - Enrich lead data with links to company pricing, security, and compliance pages before outreach - Website Auditing - Verify that your own sites have discoverable key pages and assess how competitors structure their information architecture - Market Research - Collect pricing pages across an industry to analyze pricing trends and strategies - Due Diligence - Gather legal documents (privacy policies, terms) from potential partners or acquisition targets - Compliance Monitoring - Track privacy policy and terms changes across a portfolio of vendors ## 🚀 How to use Website Key Pages Finder Follow these steps to find key pages on any website: 1. Open the Actor in Apify Console or via the API 2. Add your URLs to the Start URLs field (homepage URLs work best) 3. Configure options (optional) - adjust crawl depth, page limits, and timeout as needed 4. Run the Actor by clicking "Start" or calling the API 5. Download results from the Dataset tab in JSON, CSV, or Excel format ### Example Input json { "startUrls": [ { "url": "https://apify.com" }, { "url": "https://stripe.com" }, { "url": "https://github.com" } ], "maxDepth": 1, "maxPagesPerDomain": 12, "includeSubdomains": true } ## 💰 How much does it cost to find key pages? Website Key Pages Finder uses Pay Per Event (PPE) pricing, so you only pay for the websites you analyze. Pricing: - Per website analyzed: $0.005 per site - Start fee: $0.005 per run - No hidden compute costs - the price per website includes all crawling and processing Cost control: - Set a maximum spend per run in the Actor input to limit costs - The Actor stops gracefully when your spending limit is reached - Remaining URLs are skipped (not charged) when limit is hit Example costs: | Websites | Cost | |----------|------| | 10 | $0.055 | | 100 | $0.505 | | 1,000 | $5.005 | Free tier: Apify provides a free tier with monthly credits, typically sufficient for testing and small-scale usage. Check your Apify account for current free tier limits. ## 📥 Input | Field | Type | Default | Description | |-------|------|---------|-------------| | startUrls | array | required | URLs of websites to analyze. Each URL should be a homepage or any page from the domain. | | maxDepth | integer | 1 | Crawl depth. 0 = homepage only, 1 = homepage + priority pages (recommended). | | maxPagesPerDomain | integer | 12 | Maximum pages to fetch per domain. Controls costs and processing time. | | includeSubdomains | boolean | true | Whether to include subdomains when discovering pages (e.g., docs.example.com). | | returnTopN | integer | 1 | Number of top candidates to return per page type. Set higher to see alternative candidates. | | timeoutSecs | integer | 30 | Timeout in seconds for processing each site. | | proxyConfiguration | object | { "useApifyProxy": false } | Proxy settings for sites that block direct access. | | debug | boolean | false | Include debug information (raw candidates) in output. | ## 📤 Output Each website produces one result object in the dataset: json { "schemaVersion": "1.0.0", "inputUrl": "https://apify.com", "finalUrl": "https://apify.com/", "domain": "apify.com", "pages": { "pricing": { "url": "https://apify.com/pricing", "confidence": 0.95, "evidence": ["exact_path:/pricing", "anchor:Pricing", "footer_link"] }, "docs": { "url": "https://docs.apify.com", "confidence": 0.92, "evidence": ["subdomain:docs", "anchor:Documentation"] }, "status": { "url": "https://status.apify.com", "confidence": 0.88, "evidence": ["subdomain:status", "anchor:Status"] }, "security": { "url": "https://apify.com/security", "confidence": 0.85, "evidence": ["path_token:security", "footer_link"] }, "privacy": { "url": "https://apify.com/privacy-policy", "confidence": 0.90, "evidence": ["path_token:privacy", "anchor:Privacy Policy", "footer_link"] }, "terms": { "url": "https://apify.com/terms-of-service", "confidence": 0.88, "evidence": ["path_token:terms", "anchor:Terms of Service", "footer_link"] } }, "crawlStats": { "pagesFetched": 8, "timeMs": 2340, "errors": [], "likelyJsRendered": false }, "timestamp": "2024-01-15T10:30:00.000Z" } ### Output Fields | Field | Description | |-------|-------------| | inputUrl | The URL you provided | | finalUrl | The URL after following redirects | | domain | The root domain extracted from the URL | | pages | Object containing discovered pages for each type | | pages.[type].url | URL of the discovered page | | pages.[type].confidence | Confidence score from 0 to 1 | | pages.[type].evidence | Array of signals that contributed to the score | | crawlStats.pagesFetched | Number of pages fetched during discovery | | crawlStats.timeMs | Processing time in milliseconds | | crawlStats.errors | Any errors encountered during crawling | | crawlStats.likelyJsRendered | Whether the site appears to be JavaScript-rendered | | topCandidates | (Optional) When returnTopN > 1, contains all top candidates per type | ### Page Types | Type | Common URL Patterns | Description | |------|---------------------|-------------| | pricing | /pricing, /plans, /price | Pricing and plan information | | docs | /docs, /documentation, /api, /developer | Documentation and API reference | | status | /status, /uptime, status.example.com | System status and uptime pages | | security | /security, /trust, /compliance | Security and compliance information | | privacy | /privacy, /privacy-policy, /data-protection | Privacy policy | | terms | /terms, /tos, /terms-of-service, /legal | Terms of service | ## 📊 How does confidence scoring work? Each discovered page includes a confidence score between 0 and 1 that indicates how certain the Actor is that the page is correct. | Score | Meaning | |-------|---------| | 0.80 - 1.00 | Very confident - strong signals from URL path, anchor text, and page location | | 0.50 - 0.79 | Probable match - good evidence but some ambiguity | | 0.30 - 0.49 | Best guess - limited evidence, may need manual verification | | Below 0.30 | Not returned - insufficient confidence | ### Scoring Factors 1. Discovery Source - Base score from how the page was found - Fast-path (direct URL probe): +0.40 - Homepage link: +0.30 - Depth-1 crawl: +0.20 - Sitemap: +0.10 2. Positive Signals - Added to the score - Exact path match (e.g., /pricing): +0.30 - Token in path (e.g., /pricing-plans): +0.20 - Anchor text match: +0.25 - Footer/nav location: +0.12-0.15 - Subdomain match (e.g., docs.example.com): +0.25 3. Verification - Final adjustment after checking page content - Title matches expected keywords: +0.20 - Content verified: +0.15 - HTTP error: -0.50 - Wrong content type: -0.30 ## 🔗 Integrations and API access ### REST API Run the Actor via the Apify API: bash curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~website-key-pages-finder/runs?token=YOUR_API_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "startUrls": [ { "url": "https://example.com" } ], "maxDepth": 1, "maxPagesPerDomain": 12 }' ### JavaScript SDK javascript import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'YOUR_API_TOKEN' }); const run = await client.actor('YOUR_USERNAME/website-key-pages-finder').call({ startUrls: [ { url: 'https://example.com' }, { url: 'https://another-site.com' } ], maxDepth: 1, maxPagesPerDomain: 12 }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items); ### Python SDK python from apify_client import ApifyClient client = ApifyClient('YOUR_API_TOKEN') run = client.actor('YOUR_USERNAME/website-key-pages-finder').call(run_input={ 'startUrls': [ {'url': 'https://example.com'}, {'url': 'https://another-site.com'} ], 'maxDepth': 1, 'maxPagesPerDomain': 12 }) items = client.dataset(run['defaultDatasetId']).list_items().items print(items) ### Webhooks and Integrations Apify supports webhooks to notify your systems when a run completes. You can also integrate with: - Zapier - Trigger workflows when new data is available - Make (Integromat) - Build automated pipelines - Google Sheets - Export results directly to spreadsheets - Slack - Get notifications when runs complete See Apify Integrations for more options. ## ❓ FAQ ### What happens if a page type isn't found? If the Actor cannot find a page type with sufficient confidence (score >= 0.30), that type will be omitted from the pages object in the output. This is normal - not all websites have all page types. ### Why are some confidence scores lower than expected? Confidence scores depend on the signals found during crawling. Sites with non-standard URL structures, unusual navigation, or pages behind authentication may have lower scores. Check the evidence array to understand what signals were detected. ### Can this Actor handle JavaScript-rendered websites? This Actor uses HTTP-based crawling (CheerioCrawler) for speed and efficiency. Sites that heavily rely on JavaScript for rendering may have incomplete results. The output includes a likelyJsRendered flag to indicate when this might be an issue. For such sites, consider using a browser-based scraper. ### How do I increase accuracy for specific sites? - Increase maxPagesPerDomain to allow more thorough crawling - Set returnTopN > 1 to see alternative candidates - Enable debug mode to see all candidates and their scores ### What's the difference between maxDepth 0 and 1? - maxDepth: 0 - Only analyzes the homepage (fastest, cheapest) - maxDepth: 1 - Analyzes homepage plus follows promising links (recommended for best results) ### Does this work with sites behind login? No, this Actor only crawls publicly accessible pages. It cannot handle authentication or login flows. ## ⚖️ Is it legal to scrape websites for key pages? Web scraping legality varies by jurisdiction and use case. When using this Actor: - Respect robots.txt - The Actor follows standard web crawling conventions - Review Terms of Service - Some websites explicitly prohibit scraping in their ToS - Use reasonable rate limits - The Actor includes delays to avoid overwhelming servers - Public data only - Only scrape publicly accessible information - Intended use - Ensure your use case complies with applicable laws (GDPR, CCPA, etc.) This Actor is designed for legitimate business purposes such as competitive research, lead enrichment, and website auditing. Users are responsible for ensuring their use complies with applicable laws and website terms of service. Disclaimer: This information is not legal advice. Consult with a legal professional for guidance specific to your jurisdiction and use case. ## ⚠️ Limitations - JavaScript-rendered content - Uses HTTP-based crawling (CheerioCrawler), so heavily JavaScript-rendered sites may have incomplete results. Check the likelyJsRendered flag. - Rate limiting - Some sites may block rapid requests. The Actor includes retry logic, but sites with aggressive anti-bot measures may cause failures. - Page budget - Limited to maxPagesPerDomain fetches per site to control costs. Increase this for complex sites. - Crawl depth - Currently supports depth 0 (homepage only) or depth 1 (homepage + one level). Deep recursive crawling is not supported. - Authentication - Cannot access pages behind login or authentication. ## 🔄 Related Actors Looking for more web scraping solutions? Check out these related Actors: - Website Content Crawler - Extract all text content from websites - Web Scraper - General-purpose web scraping with custom selectors - Cheerio Scraper - Fast HTTP-based scraping for static sites ## 📚 Resources and support - Apify Platform Documentation - Learn how to use Apify - Report Issues - Found a bug? Let us know - Apify Discord - Join the community for help and discussions ## 🛠️ Local Development ### Prerequisites - Node.js 18+ - npm ### Setup bash # Install dependencies npm install # Run locally apify run # Run with custom input apify run --input='{"startUrls":[{"url":"https://example.com"}]}' ### Deploy bash # Login to Apify apify login # Deploy to Apify platform apify push

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Website Key Pages Finder now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
yummy_gelato
Pricing
Paid
Total Runs
1
Active Users
1
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support