News Blindspot

by spirosdouk

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC ...

28 runs
3 users
Try This Actor

Opens on Apify.com

About News Blindspot

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.

What does this actor do?

News Blindspot is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Open-data Blindspot Detector Node.js View on Apify License CI Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources. ## Live demo - Run: View on Apify - Dataset: View JSON output ## What it does Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor: - Fetches news articles from GDELT DOC API for specified queries - Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides - Clusters articles by story (title + day) to deduplicate rewrites - Calculates known-only shares (Unknown excluded from denominator) for each political side - Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side - Emits representative URLs per side and an Unknown% confidence proxy ## Stack & data sources Stack: - Node.js 20+ - Apify Actor framework - axios, p-limit, tldts, csv-parse - Optional: @xenova/transformers (for LLM weak labeler) Data sources: - GDELT DOC API — article metadata (title, URL, source, date) - AllSides — brand→bias mapping (GitHub CSV) - MBFC — domain→bias mapping (GitHub Gist) - Local overrides — optional JSON file for manual corrections No private or paid data; no background scraping. ## Key design choices - English-only filter — GDELT sourcelang:english to reduce non-English noise - Cluster by story — title+day clustering → compute shares on clusters, not raw articles - Known-only math*_pct_known excludes Unknown from denominator; excludeUnknownFromBlindspot uses known-only for flagging - Gap-aware flags — requires both side_pct < blindspotThresholdPct AND gap_vs_next_pct >= gapMinPct - Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown - LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12) - Provenance — label method captured for auditability (domain, bias, method) - Confidenceunknown_pct reported; confidence = 1 - unknown_pct/100 ## Install & run Prerequisites: - Node.js 20+ - Apify CLI (npm install -g apify-cli@latest) or Apify platform account - Note: This project uses the new .actor/actor.json format with a minimal apify.json for CLI compatibility Local: bash npm install npx apify run Apify platform: 1. Upload actor to Apify platform 2. Click "Run actor" with input JSON (see below) ## Typical input json { "queries": ["climate change", "immigration"], "sinceHours": 24, "blindspotThresholdPct": 20, "gapMinPct": 12, "maxRepUrlsPerSide": 3, "restrictToEnglish": true, "excludeUnknownFromBlindspot": true, "overridesPath": "./bias-overrides.json", "enableLearning": true, "learningMinCount": 3, "learningMinConsistency": 0.8, "suggestionsMax": 15, "suggestionsMinCount": 2, "enableLlmWeakLabeler": false, "llmMaxDomains": 15, "llmMinCount": 3, "llmConfidenceThreshold": 0.8, "llmMarginThreshold": 0.12, "llmSampleTitlesPerDomain": 8, "forceRefreshCache": false } Important defaults: - restrictToEnglish: true — filters to English sources - excludeUnknownFromBlindspot: true — uses known-only shares for blindspot detection - blindspotThresholdPct: 20 — side must be <20% to flag - gapMinPct: 12 — gap to next side must be ≥12% to flag - enableLlmWeakLabeler: falserecommended default (keeps runs lean) ## Input fields | Field | Type | Default | Description | | ----------------------------- | ---------- | ----------------------------------- | ------------------------------------------------------------------ | | queries | string[] | ["climate change", "immigration"] | Search queries for GDELT | | sinceHours | integer | 24 | Lookback window (1–168 hours) | | blindspotThresholdPct | number | 20 | Side must be below this % to flag (0–100) | | gapMinPct | number | 12 | Minimum gap vs next side to flag (0–100) | | maxRepUrlsPerSide | integer | 3 | Representative URLs per side (1–20) | | restrictToEnglish | boolean | true | Flag: Filter to English sources via GDELT | | excludeUnknownFromBlindspot | boolean | true | Flag: Use known-only shares for blindspot math | | overridesPath | string | "./bias-overrides.json" | Path to overrides JSON: { "example.com": "left\|center\|right" } | | enableLearning | boolean | true | Enable conservative alias learning | | learningMinCount | integer | 3 | Min samples to learn alias (2–100) | | learningMinConsistency | number | 0.8 | Min consistency to learn (0.7–1) | | suggestionsMax | integer | 15 | Max suggested overrides per query (0–100) | | suggestionsMinCount | integer | 2 | Min articles to suggest override (1–50) | | enableLlmWeakLabeler | boolean | false | Flag: Enable LLM weak labeler (OFF by default) | | llmMaxDomains | integer | 15 | Top unknown eTLD+1 to try (1–50) | | llmMinCount | integer | 3 | Min articles per domain (2–50) | | llmConfidenceThreshold | number | 0.8 | Min confidence (0.5–1) | | llmMarginThreshold | number | 0.12 | Min margin winner–runnerUp (0–1) | | llmSampleTitlesPerDomain | integer | 8 | Titles per domain to sample (1–20) | | forceRefreshCache | boolean | false | Force refresh bias cache (ignores 7-day TTL) | ## Output schema Each query produces one result object: json { "schema_version": "1.1.0", "generated_at_utc": "2025-01-01T00:00:00.000Z", "query": "climate change", "total_clusters": 41, "total_articles_raw": 96, "left_pct": 31.7, "center_pct": 0.0, "right_pct": 4.9, "unknown_pct": 63.4, "left_pct_known": 86.7, "center_pct_known": 0.0, "right_pct_known": 13.3, "known_articles": 15, "blindspot_flags": [ { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 } ], "representative_urls": { "left": ["https://..."], "center": [], "right": ["https://..."], "unknown": ["https://..."] }, "confidence": 0.37, "confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters", "unknown_summary": { "top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"], "suggested_overrides": [ { "eTLD1": "allafrica.com", "support": 3, "mode_side": null, "consistency": null } ], "suggested_overrides_snippet": "{\n \"allafrica.com\": \"/* decide bias */\"\n}" }, "provenance": { "labels_used": [ { "domain": "bostonglobe.com", "bias": "left", "method": "domain:authoritative:host" } ] } } Field notes: - *_pctnumbers (percent values to one decimal) - blindspot_flags — zero or more flags; appears when side_pct < blindspotThresholdPct and gap_vs_next_pct >= gapMinPct - confidence1 - unknown_pct/100 (0.0–1.0) - representative_urls — earliest URL per cluster, deduped by host, up to maxRepUrlsPerSide per side - suggested_overrides — candidates for manual review (sorted by support count) ## Interpreting results Blindspot flags: - A flag means a side is both below the threshold (blindspotThresholdPct) and has a meaningful gap vs the next side (gapMinPct). - Example: { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 } means center has 0% coverage and is 13.3% below the next side. Unknown% and confidence: - High unknown_pct → low confidence → results based on fewer known sources. - If unknown_pct > 50%, consider widening sinceHours or adding overrides. Clusters vs articles: - total_clusterstotal_articles_raw due to title+day deduplication (rewrites of same story). Representativeness caveats: - GDELT is broad but not exhaustive. - Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often unknown unless overridden. - Representative URLs are earliest per cluster, not necessarily most authoritative. ## Overrides & human-in-the-loop Maintain a small bias-overrides.json (≤100 entries) for frequent eTLD+1s: json { "example.com": "center", "paper.co.uk": "right" } Workflow: 1. Review unknown_summary.suggested_overrides (sorted by support count). 2. Research each domain (editorial stance, ownership, fact-checking records). 3. Add to bias-overrides.json with chosen side (left, center, or right). 4. Auditable process: PRs should state evidence and chosen side. Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown. ## LLM weak labeler (optional) Default: OFF. When enabled: - Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI). - Strict thresholds: confidence ≥ 0.8, margin ≥ 0.12. - Caches results in bias-cache.json (targeted_cache.llm). - Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model). - Not a replacement for overrides; use sparingly for top unknown domains. When to enable: - High Unknown% and manual overrides are impractical. - Offline model availability (Transformers.js downloads models on first use). ## Operational notes User-Agent: - Update USER_AGENT in main.js with a real contact email: "BlindspotDetectorBot/2.1 (contact: you@example.com)" Caching: - bias-cache.json refreshed ≤7 days (unless forceRefreshCache: true). - Includes targeted_cache.llm (LLM weak labels) and llm_neg (14-day skip list). - Cache persists across runs. Logs to expect: - Fetch counts: 🔍 Fetching news from GDELT... - Clusters: 📦 Query "X": N articles → M clusters - Unknown%: confidence_note in output - Flags: blindspot_flags array Throughput tips: - Limit queries array size (parallel fetches, but GDELT rate limits apply). - Widen sinceHours for sparse topics (more articles → better coverage). ## Limitations - Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved. - Non-English and hyper-local outlets: Often remain unknown unless overridden. - GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation. - Not a fact-checker: Measures coverage, not truth or accuracy. - English filter: restrictToEnglish: true excludes non-English sources (may miss relevant coverage). ## Security & ethics - Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub). - No scraping: No paywalls, login, or background scraping. - Transparent heuristics: Provenance captured; human review for overrides. - Respectful rate limits: Retries with backoff; User-Agent identifies bot. ## Roadmap - Title-relevance gate (behind a flag) to filter off-topic articles. - Fill mode_side/consistency in suggestions by tallies from known sources. - Provenance source granularity (AllSides vs MBFC vs Overrides in method field). - Multilingual support when infrastructure allows (currently English-only filter). ## Troubleshooting High Unknown%: - Widen sinceHours (more articles → more known sources). - Add overrides for frequent unknown domains (see unknown_summary.suggested_overrides). Empty reps for a side: - Story imbalance or sparse data; increase sinceHours or check query specificity. LLM model load errors: - Ensure offline models available (Transformers.js downloads on first use). - Keep enableLlmWeakLabeler: false if models unavailable. Cache issues: - Set forceRefreshCache: true to rebuild bias map. - Check bias-cache.json exists and is valid JSON. ## Development Repo structure: blindspot-detector/ ├── main.js # Main actor logic ├── apify.json # Minimal config for CLI compatibility ├── .actor/ │ └── actor.json # Detailed actor configuration & input schema ├── package.json # Dependencies ├── bias-cache.json # Cached bias maps (auto-generated) ├── bias-overrides.json # Manual overrides (optional) └── input.json # Example input Configuration files: - apify.json - Minimal configuration file for Apify CLI compatibility (legacy format support) - .actor/actor.json - Detailed actor configuration including input schema, metadata, and build settings Scripts: bash npm install # Install dependencies npx apify run # Run locally Code style: - Node.js 20+, ES modules - Async/await, p-limit for concurrency - Conservative learning thresholds ## License MIT

Categories

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try News Blindspot now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
spirosdouk
Pricing
Paid
Total Runs
28
Active Users
3
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support