GTM Leads Cleaner
by yummy_gelato
Upload any lead CSV and get a CRM-ready dataset: email validation, name/company cleanup, job-title bucketing, and dedupe by email or domain+name.
Opens on Apify.com
About GTM Leads Cleaner
Upload any lead CSV and get a CRM-ready dataset: email validation, name/company cleanup, job-title bucketing, and dedupe by email or domain+name.
What does this actor do?
GTM Leads Cleaner is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
GTM Leads Cleaner - CSV Lead Deduplication & Email Validation ## What is GTM Leads Cleaner? GTM Leads Cleaner is an Apify Actor that cleans, normalizes, and deduplicates GTM (Go-To-Market) lead data from CSV files. Built for sales teams, RevOps professionals, and marketers who need to prepare leads for CRM import with validated emails, standardized names, and categorized job titles. ## See It In Action 🎬 Video demo coming soon! ## Why Use GTM Leads Cleaner? - ✅ Save hours of manual work - Process 10,000 leads in 2-3 minutes - ✅ Improve CRM data quality - Validated emails, standardized names, clean formatting - ✅ Better lead routing - GTM-focused job title categorization for accurate scoring - ✅ Smart deduplication - Match by email or domain+name combination - ✅ Pay only for what you use - Just $0.001 per lead processed ## Use Cases ### Clean CRM Exports Before Re-Import Export your HubSpot, Salesforce, or Pipedrive contacts, run them through the cleaner, and re-import with normalized data and duplicates removed. ### Deduplicate Leads from Multiple Sources Combine leads from trade shows, webinars, content downloads, and scraped data into a single clean list without duplicates. ### Prepare Sales Intelligence Exports Clean exports from Apollo.io, ZoomInfo, or LinkedIn Sales Navigator before loading into your CRM or sales engagement platform. ### Standardize Job Titles for Lead Scoring Categorize job titles into consistent GTM buckets (Founder/C-level, Sales leadership, Marketing IC, etc.) for accurate lead scoring and routing. ## Features - 📧 Email Validation & Normalization - Trims whitespace, lowercases, validates format, and extracts first email from multi-email fields - 👤 Name Processing - Splits full names into first/last, normalizes whitespace - 🏢 Company Normalization - Cleans company names, removes extra whitespace - 🌐 Domain Extraction - Derives domain from email or website column - 🎯 Job Title Bucketing - Categorizes job titles into 10 GTM-focused buckets - 🔄 Lead Deduplication - Finds duplicates by email or domain+name combination - 🔍 Auto Column Detection - Automatically detects column mappings from various header formats - 💰 Pay-Per-Event Pricing - Only pay for leads you actually process ## How Much Does It Cost to Clean Leads? The GTM Leads Cleaner uses Apify's pay-per-event pricing model: | Volume | Cost per Lead | Example | |--------|---------------|---------| | Any volume | $0.001 | 1,000 leads = $1.00 | ### Cost Comparison | Method | Cost for 10,000 Leads | Time | |--------|----------------------|------| | GTM Leads Cleaner | ~$10 | 2-3 minutes | | Manual cleaning | $200-500 (VA time) | 8-20 hours | | Custom script | $0 + dev time | Hours to build | Typical run times: - 1,000 rows: ~30 seconds - 10,000 rows: ~2-3 minutes - 100,000 rows: ~15-20 minutes ## Tutorial: How to Clean Your Lead CSV ### Step 1: Prepare Your CSV Ensure your CSV file: - Is UTF-8 encoded - Has a header row - Contains at minimum an email column ### Step 2: Upload Your File You have three options: 1. File Upload - Use the file upload button in the Apify Console 2. URL - Provide a direct URL to your CSV file 3. Key-Value Store - Reference a file already in your Apify Key-Value Store ### Step 3: Configure Options json { "inputFile": "leads.csv", "dedupeStrategy": "email", "outputFormat": "dataset", "includeDuplicates": false } Key options: - dedupeStrategy: Choose "email" for email-based matching or "domain+name" for fuzzy matching - outputFormat: "dataset" for API access or "csv" for downloadable file - includeDuplicates: Set to true if you want to see duplicate rows (marked with is_duplicate=true) ### Step 4: Run and Download Results 1. Click "Start" to run the Actor 2. Wait for completion (check the "Runs" tab for progress) 3. Download results from the "Storage" tab: - Dataset: Clean leads in JSON format - Key-Value Store: cleaned_leads.csv (if CSV output enabled) and SUMMARY stats ## Input Schema | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | inputFile | string | required | CSV file (upload, URL, or KV store key) | | dedupeStrategy | enum | "email" | "email" or "domain+name" | | outputFormat | enum | "dataset" | "dataset" or "csv" | | includeDuplicates | boolean | false | Keep duplicate rows in output | | autoDetectPreference | enum | "first" | Tie-breaking: "first", "last", or "fail" | | emailColumn | string | auto | Manual email column override | | nameColumn | string | auto | Manual name column override | | companyColumn | string | auto | Manual company column override | | jobTitleColumn | string | auto | Manual job title column override | | fieldMap | object | {} | Programmatic column mapping (highest priority) | ### Deduplication Strategies - email - Matches on normalized email address. First occurrence is primary, subsequent matches are marked as duplicates. - domain+name - Matches on normalized full name + domain combination. Useful when the same person appears with different email addresses. ### Auto-Detection Preferences When multiple columns match a pattern (e.g., both "Email" and "Work Email"): - first - Uses the first matching column (leftmost in CSV) - last - Uses the last matching column (rightmost in CSV) - fail - Aborts with error listing candidates ## Output Format ### Dataset Output (default) Each row is pushed to the Apify default dataset with canonical fields: json { "original_row_index": 1, "email": "JANE@ACME.COM", "normalized_email": "jane@acme.com", "email_is_valid": true, "full_name": "Jane Doe", "first_name": "Jane", "last_name": "Doe", "company": "Acme Inc", "domain": "acme.com", "role_raw": "Head of Growth", "role_bucket": "Marketing leadership", "is_duplicate": false, "duplicate_of_index": null, "dedupe_strategy_used": null, "source_file": "leads.csv", "error_message": null } ### CSV Export When outputFormat: "csv", a cleaned_leads.csv file is written to the Key-Value Store with: 1. Canonical GTM fields (fixed order) 2. Original columns (preserved order) ### Summary Statistics A SUMMARY JSON is always written to the Key-Value Store: json { "total_rows": 1000, "processed_rows": 1000, "duplicate_rows": 50, "unique_leads": 950, "invalid_email_rows": 25, "input_file_name": "leads.csv", "dedupe_strategy": ["email"], "warnings": [], "created_at": "2024-01-15T10:30:00Z" } ## Job Title Buckets for Lead Categorization Job titles are automatically categorized into GTM-focused buckets (9 defined + "Other" fallback): | Bucket | Example Keywords | |--------|-----------------| | Founder / C-level | founder, ceo, cto, cfo, chief, president, owner | | RevOps / SalesOps | revops, revenue operations, sales operations, crm manager | | Marketing leadership | head of marketing, vp marketing, marketing director, growth lead | | Sales leadership | head of sales, vp sales, sales director, sales manager | | Marketing IC | marketing specialist, demand gen specialist, content marketer | | Sales IC | account executive, sdr, bdr, business development | | Product | product manager, product owner, product lead | | Engineering / Technical | engineer, developer, architect, devops | | Customer Success | customer success, csm, account manager, onboarding | | Other | (default fallback for unmatched titles) | ## CSV Column Auto-Detection The Actor recognizes common header variations: | Field | Recognized Headers | |-------|-------------------| | Email | email, e-mail, work email, contact email | | Full Name | name, full name, contact, person | | First Name | first name, given name, first | | Last Name | last name, surname, family name, last | | Company | company, organization, org, employer | | Job Title | title, job title, position, role | | Domain | domain, website, url, company domain | Headers are matched case-insensitively. ## Error Handling ### Fatal Errors (Actor fails) - Invalid file format (not .csv) - UTF-8 decode failure - Missing required email column - Empty input file - Tie-breaking with "fail" preference when multiple candidates exist ### Row-Level Errors Rows with processing errors continue through the pipeline with: - error_message field set - email_is_valid set to false - Other fields populated where possible ### Warnings Non-fatal issues are logged and included in the summary: - High duplicate rate (>30%) - High invalid email rate (>20%) - Column detection ambiguities ## Integrations & API Access ### Zapier Integration 1. Use the "Apify" app in Zapier 2. Select "Run Actor" action 3. Choose "gtm-leads-cleaner" Actor 4. Map your CSV file URL to the inputFile parameter 5. Use "Get Dataset Items" to retrieve cleaned leads ### Make.com (Integromat) 1. Add the Apify module to your scenario 2. Use "Run an Actor" action 3. Configure input with your CSV file 4. Use "Get Dataset Items" to retrieve results 5. Route cleaned leads to your CRM module ### n8n Workflow 1. Use the Apify node 2. Set operation to "Run Actor" 3. Configure the Actor ID and input parameters 4. Use HTTP Request node to fetch dataset results 5. Connect to your CRM node (HubSpot, Salesforce, etc.) ### Python SDK python from apify_client import ApifyClient client = ApifyClient("your-api-token") actor = client.actor("your-username/gtm-leads-cleaner") run = actor.call(run_input={ "inputFile": "https://example.com/leads.csv", "dedupeStrategy": "email", "outputFormat": "dataset" }) # Get results dataset = client.dataset(run["defaultDatasetId"]) for item in dataset.iterate_items(): print(item["normalized_email"], item["is_duplicate"]) ### JavaScript / Node.js SDK javascript import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'your-api-token' }); const run = await client.actor('your-username/gtm-leads-cleaner').call({ inputFile: 'https://example.com/leads.csv', dedupeStrategy: 'email', outputFormat: 'dataset' }); // Get results const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach(item => { console.log(item.normalized_email, item.is_duplicate); }); ### Direct API Call bash curl -X POST "https://api.apify.com/v2/acts/your-username~gtm-leads-cleaner/runs?token=your-api-token" \ -H "Content-Type: application/json" \ -d '{ "inputFile": "https://example.com/leads.csv", "dedupeStrategy": "email" }' ## FAQ ### What CSV formats are supported? The Actor supports standard UTF-8 encoded CSV files with a header row. Files must have the .csv extension. The Actor handles various delimiters and quote characters automatically. ### Can I use custom column mappings? Yes! You have three options: 1. Individual overrides: Use emailColumn, nameColumn, companyColumn, or jobTitleColumn to specify exact header names 2. Field map: Use the fieldMap parameter for programmatic mapping of all fields at once 3. Auto-detection: Let the Actor detect columns automatically (works with most common header formats) ### How does deduplication work? The Actor supports two deduplication strategies: - Email-based: Compares normalized email addresses (lowercased, trimmed). First occurrence is kept as the primary record. - Domain+Name: Compares the combination of domain (from email or website) and normalized full name. Useful when the same person has multiple email addresses. Duplicates are either filtered out (default) or marked with is_duplicate=true and duplicate_of_index pointing to the primary record (when includeDuplicates=true). ### What happens to invalid emails? Rows with invalid emails are still processed and included in the output. They are marked with: - email_is_valid: false - normalized_email: The original email (lowercased and trimmed) - All other fields are processed normally You can filter these out in your downstream system or use the email_is_valid field for conditional logic. ### Does it support pay-per-event pricing? Yes! The Actor uses Apify's pay-per-event model. You're charged $0.001 per processed lead, meaning you only pay for what you use. The pricing appears as "Charged for X events" in your Apify billing. ### Can I keep duplicate rows in the output? Yes, set includeDuplicates: true in your input. Duplicates will be included but marked with is_duplicate: true and duplicate_of_index showing which record they duplicate. ### What's the maximum file size? There's no hard limit, but for optimal performance: - Files under 100MB process quickly - Larger files may require more memory (adjust in Actor settings) - For very large files (1M+ rows), consider splitting into chunks ## Development ### Local Development bash # Install dependencies uv sync # Run tests uv run pytest tests/ -v # Run locally apify run ### Test Commands bash # Run all tests uv run pytest tests/ -v # Run with coverage uv run pytest tests/ --cov=src --cov-report=html # Run specific test file uv run pytest tests/test_integration.py -v ## Related Apify Actors Looking for more data processing and lead generation tools? Check out these related Actors: - 🔎 Google Maps Scraper - Extract business data from Google Maps - 💼 LinkedIn Profile Scraper - Scrape LinkedIn profiles for lead enrichment - 📊 CSV to JSON Converter - Convert CSV files to JSON format ## Resources - Apify Platform Documentation - Apify SDK for Python - Actor Development Guide ## License Apache 2.0
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try GTM Leads Cleaner now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- yummy_gelato
- Pricing
- Paid
- Total Runs
- 1
- Active Users
- 1
Related Actors
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Linkedin Profile Details Scraper + EMAIL (No Cookies Required)
by apimaestro
Twitter (X.com) Scraper Unlimited: No Limits
by apidojo
Content Checker
by jakubbalada
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support