News Blindspot

by spirosdouk

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC ...

28 runs

3 users

Opens on Apify.com

About News Blindspot

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.

What does this actor do?

News Blindspot is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Open-data Blindspot Detector Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources. ## Live demo - Run: View on Apify - Dataset: View JSON output ## What it does Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor: - Fetches news articles from GDELT DOC API for specified queries - Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides - Clusters articles by story (title + day) to deduplicate rewrites - Calculates known-only shares (Unknown excluded from denominator) for each political side - Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side - Emits representative URLs per side and an Unknown% confidence proxy ## Stack & data sources Stack: - Node.js 20+ - Apify Actor framework - axios, p-limit, tldts, csv-parse - Optional: @xenova/transformers (for LLM weak labeler) Data sources: - GDELT DOC API — article metadata (title, URL, source, date) - AllSides — brand→bias mapping (GitHub CSV) - MBFC — domain→bias mapping (GitHub Gist) - Local overrides — optional JSON file for manual corrections No private or paid data; no background scraping. ## Key design choices - English-only filter — GDELT `sourcelang:english` to reduce non-English noise - Cluster by story — title+day clustering → compute shares on clusters, not raw articles - Known-only math — `_pct_known` excludes Unknown from denominator; `excludeUnknownFromBlindspot` uses known-only for flagging - Gap-aware flags — requires both `side_pct < blindspotThresholdPct` AND `gap_vs_next_pct >= gapMinPct` - Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown - LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12) - Provenance — label method captured for auditability (`domain`, `bias`, `method`) - Confidence — `unknown_pct` reported; `confidence = 1 - unknown_pct/100` ## Install & run Prerequisites: - Node.js 20+ - Apify CLI (`npm install -g apify-cli@latest`) or Apify platform account - Note: This project uses the new `.actor/actor.json` format with a minimal `apify.json` for CLI compatibility Local: `bash npm install npx apify run` Apify platform: 1. Upload actor to Apify platform 2. Click "Run actor" with input JSON (see below) ## Typical input json { "queries": ["climate change", "immigration"], "sinceHours": 24, "blindspotThresholdPct": 20, "gapMinPct": 12, "maxRepUrlsPerSide": 3, "restrictToEnglish": true, "excludeUnknownFromBlindspot": true, "overridesPath": "./bias-overrides.json", "enableLearning": true, "learningMinCount": 3, "learningMinConsistency": 0.8, "suggestionsMax": 15, "suggestionsMinCount": 2, "enableLlmWeakLabeler": false, "llmMaxDomains": 15, "llmMinCount": 3, "llmConfidenceThreshold": 0.8, "llmMarginThreshold": 0.12, "llmSampleTitlesPerDomain": 8, "forceRefreshCache": false } Important defaults: - `restrictToEnglish: true` — filters to English sources - `excludeUnknownFromBlindspot: true` — uses known-only shares for blindspot detection - `blindspotThresholdPct: 20` — side must be <20% to flag - `gapMinPct: 12` — gap to next side must be ≥12% to flag - `enableLlmWeakLabeler: false` — recommended default (keeps runs lean) ## Input fields | Field | Type | Default | Description | | ----------------------------- | ---------- | ----------------------------------- | ------------------------------------------------------------------ | | `queries` | `string[]` | `["climate change", "immigration"]` | Search queries for GDELT | | `sinceHours` | `integer` | `24` | Lookback window (1–168 hours) | | `blindspotThresholdPct` | `number` | `20` | Side must be below this % to flag (0–100) | | `gapMinPct` | `number` | `12` | Minimum gap vs next side to flag (0–100) | | `maxRepUrlsPerSide` | `integer` | `3` | Representative URLs per side (1–20) | | `restrictToEnglish` | `boolean` | `true` | Flag: Filter to English sources via GDELT | | `excludeUnknownFromBlindspot` | `boolean` | `true` | Flag: Use known-only shares for blindspot math | | `overridesPath` | `string` | `"./bias-overrides.json"` | Path to overrides JSON: `{ "example.com": "left\|center\|right" }` | | `enableLearning` | `boolean` | `true` | Enable conservative alias learning | | `learningMinCount` | `integer` | `3` | Min samples to learn alias (2–100) | | `learningMinConsistency` | `number` | `0.8` | Min consistency to learn (0.7–1) | | `suggestionsMax` | `integer` | `15` | Max suggested overrides per query (0–100) | | `suggestionsMinCount` | `integer` | `2` | Min articles to suggest override (1–50) | | `enableLlmWeakLabeler` | `boolean` | `false` | Flag: Enable LLM weak labeler (OFF by default) | | `llmMaxDomains` | `integer` | `15` | Top unknown eTLD+1 to try (1–50) | | `llmMinCount` | `integer` | `3` | Min articles per domain (2–50) | | `llmConfidenceThreshold` | `number` | `0.8` | Min confidence (0.5–1) | | `llmMarginThreshold` | `number` | `0.12` | Min margin winner–runnerUp (0–1) | | `llmSampleTitlesPerDomain` | `integer` | `8` | Titles per domain to sample (1–20) | | `forceRefreshCache` | `boolean` | `false` | Force refresh bias cache (ignores 7-day TTL) | ## Output schema Each query produces one result object: json { "schema_version": "1.1.0", "generated_at_utc": "2025-01-01T00:00:00.000Z", "query": "climate change", "total_clusters": 41, "total_articles_raw": 96, "left_pct": 31.7, "center_pct": 0.0, "right_pct": 4.9, "unknown_pct": 63.4, "left_pct_known": 86.7, "center_pct_known": 0.0, "right_pct_known": 13.3, "known_articles": 15, "blindspot_flags": [ { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 } ], "representative_urls": { "left": ["https://..."], "center": [], "right": ["https://..."], "unknown": ["https://..."] }, "confidence": 0.37, "confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters", "unknown_summary": { "top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"], "suggested_overrides": [ { "eTLD1": "allafrica.com", "support": 3, "mode_side": null, "consistency": null } ], "suggested_overrides_snippet": "{\n \"allafrica.com\": \"/ decide bias /\"\n}" }, "provenance": { "labels_used": [ { "domain": "bostonglobe.com", "bias": "left", "method": "domain:authoritative:host" } ] } } Field notes: - `_pct` — numbers (percent values to one decimal) - `blindspot_flags` — zero or more flags; appears when `side_pct < blindspotThresholdPct` and `gap_vs_next_pct >= gapMinPct` - `confidence` — `1 - unknown_pct/100` (0.0–1.0) - `representative_urls` — earliest URL per cluster, deduped by host, up to `maxRepUrlsPerSide` per side - `suggested_overrides` — candidates for manual review (sorted by support count) ## Interpreting results Blindspot flags: - A flag means a side is both below the threshold (`blindspotThresholdPct`) and has a meaningful gap vs the next side (`gapMinPct`). - Example: `{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }` means center has 0% coverage and is 13.3% below the next side. Unknown% and confidence: - High `unknown_pct` → low `confidence` → results based on fewer known sources. - If `unknown_pct > 50%`, consider widening `sinceHours` or adding overrides. Clusters vs articles: - `total_clusters` ≤ `total_articles_raw` due to title+day deduplication (rewrites of same story). Representativeness caveats: - GDELT is broad but not exhaustive. - Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often `unknown` unless overridden. - Representative URLs are earliest per cluster, not necessarily most authoritative. ## Overrides & human-in-the-loop Maintain a small `bias-overrides.json` (≤100 entries) for frequent eTLD+1s: `json { "example.com": "center", "paper.co.uk": "right" }` Workflow: 1. Review `unknown_summary.suggested_overrides` (sorted by support count). 2. Research each domain (editorial stance, ownership, fact-checking records). 3. Add to `bias-overrides.json` with chosen side (`left`, `center`, or `right`). 4. Auditable process: PRs should state evidence and chosen side. Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown. ## LLM weak labeler (optional) Default: OFF. When enabled: - Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI). - Strict thresholds: `confidence ≥ 0.8`, `margin ≥ 0.12`. - Caches results in `bias-cache.json` (`targeted_cache.llm`). - Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model). - Not a replacement for overrides; use sparingly for top unknown domains. When to enable: - High Unknown% and manual overrides are impractical. - Offline model availability (Transformers.js downloads models on first use). ## Operational notes User-Agent: - Update `USER_AGENT` in `main.js` with a real contact email: `"BlindspotDetectorBot/2.1 (contact: you@example.com)"` Caching: - `bias-cache.json` refreshed ≤7 days (unless `forceRefreshCache: true`). - Includes `targeted_cache.llm` (LLM weak labels) and `llm_neg` (14-day skip list). - Cache persists across runs. Logs to expect: - Fetch counts: `🔍 Fetching news from GDELT...` - Clusters: `📦 Query "X": N articles → M clusters` - Unknown%: `confidence_note` in output - Flags: `blindspot_flags` array Throughput tips: - Limit `queries` array size (parallel fetches, but GDELT rate limits apply). - Widen `sinceHours` for sparse topics (more articles → better coverage). ## Limitations - Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved. - Non-English and hyper-local outlets: Often remain `unknown` unless overridden. - GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation. - Not a fact-checker: Measures coverage, not truth or accuracy. - English filter: `restrictToEnglish: true` excludes non-English sources (may miss relevant coverage). ## Security & ethics - Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub). - No scraping: No paywalls, login, or background scraping. - Transparent heuristics: Provenance captured; human review for overrides. - Respectful rate limits: Retries with backoff; User-Agent identifies bot. ## Roadmap - Title-relevance gate (behind a flag) to filter off-topic articles. - Fill `mode_side`/`consistency` in suggestions by tallies from known sources. - Provenance source granularity (AllSides vs MBFC vs Overrides in method field). - Multilingual support when infrastructure allows (currently English-only filter). ## Troubleshooting High Unknown%: - Widen `sinceHours` (more articles → more known sources). - Add overrides for frequent unknown domains (see `unknown_summary.suggested_overrides`). Empty reps for a side: - Story imbalance or sparse data; increase `sinceHours` or check query specificity. LLM model load errors: - Ensure offline models available (Transformers.js downloads on first use). - Keep `enableLlmWeakLabeler: false` if models unavailable. Cache issues: - Set `forceRefreshCache: true` to rebuild bias map. - Check `bias-cache.json` exists and is valid JSON. ## Development Repo structure: `blindspot-detector/ ├── main.js # Main actor logic ├── apify.json # Minimal config for CLI compatibility ├── .actor/ │ └── actor.json # Detailed actor configuration & input schema ├── package.json # Dependencies ├── bias-cache.json # Cached bias maps (auto-generated) ├── bias-overrides.json # Manual overrides (optional) └── input.json # Example input` Configuration files: - `apify.json` - Minimal configuration file for Apify CLI compatibility (legacy format support) - `.actor/actor.json` - Detailed actor configuration including input schema, metadata, and build settings Scripts: `bash npm install # Install dependencies npx apify run # Run locally` Code style: - Node.js 20+, ES modules - Async/await, p-limit for concurrency - Conservative learning thresholds ## License MIT

Categories

NEWS

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try News Blindspot now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: spirosdouk
Pricing: Paid
Total Runs: 28
Active Users: 3

Related Actors

Smart Article Extractor

Smart Article Extractor

by lukaskrivka

Google Search

by devisty

Twitter Tweets Scraper

Twitter Tweets Scraper

by gentle_cloud

Twitter Profile

by danek

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support