Audio & Video to Text

Audio & Video to Text

by donjuan_mime

Transcribe audio and video to text and subtitles using OpenAI's Whisper. Outputs TXT, SRT, VTT, JSON, and TSV for developers and content creators.

503 runs
64 users
Try This Actor

Opens on Apify.com

About Audio & Video to Text

Need to get text out of a video or audio file? This actor uses OpenAI's Whisper to handle the transcription for you, turning spoken content into editable, searchable text. It's perfect for developers and content creators who need to generate subtitles, create blog posts from podcasts, or analyze interview recordings without manual work. You can choose from the preloaded tiny, base, or small Whisper models to balance speed and accuracy for your project. It outputs to all the standard formats you'd expect—plain TXT files, subtitle files like SRT and VTT, or structured data in JSON and TSV. I use it to quickly caption social media clips and prep transcripts for my own projects. It runs reliably on Apify, so you can plug it right into your automation workflows. If you're dealing with lectures, meetings, or any media file, this is a straightforward way to get a text version you can actually use.

What does this actor do?

Audio & Video to Text is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Audio & Video to Text Transcription

Overview

An Apify actor that transcribes audio and video files into text and subtitle formats using OpenAI's Whisper model. It processes direct URLs to media files (like MP4s) and outputs multiple standard formats.

Key Features

  • OpenAI Whisper Integration: Leverages Whisper for accurate speech-to-text transcription.
  • Multiple Output Formats: Returns transcriptions in JSON, plain text (TXT), SRT, VTT, and TSV formats in a single run.
  • Pre-installed Models: The tiny, base, and small Whisper models are included in the Docker image for faster, offline processing.
  • Direct URL Processing: Accepts a publicly accessible URL to an audio or video file.

Input

Configure the actor using a JSON input object with these parameters:

  • model (string): The Whisper model to use.
    • Options: tiny, base, small, medium, large, turbo.
    • The tiny, base, and small models are pre-installed for immediate use. Larger models (medium, large, turbo) will be downloaded on first run.
  • source_url (string): The direct URL to the video or audio file (e.g., https://example.com/video.mp4).
    • Note: YouTube links are not supported directly. You must provide a URL to a downloadable media file.

Example Input

{
  "model": "small",
  "source_url": "https://example.com/path/to/your/audio.mp3"
}

Output

The actor returns a JSON array containing a single object. This object holds the transcription in five different formats under the following keys:

  • json: The full, structured Whisper output including segments, tokens, and metadata.
  • txt: A plain text version of the transcribed speech.
  • srt: SubRip subtitle format.
  • vtt: WebVTT subtitle format.
  • tsv: Tab-separated values with columns for start time, end time, and text.

Example Output (Excerpt)

[
  {
    "json": "{ \"text\": \"What's your favorite drink?\", \"segments\": [...] }",
    "txt": "What's your favorite drink?\nMy favorite drink is apple juice.\n",
    "srt": "1\n00:00:00,000 --> 00:00:01,120\nWhat's your favorite drink?\n",
    "vtt": "WEBVTT\n\n00:00.000 --> 00:01.120\nWhat's your favorite drink?",
    "tsv": "start\tend\ttext\n0\t1120\tWhat's your favorite drink?"
  }
]

How to Use

  1. In your Apify dashboard, create a new actor or task.
  2. Copy and paste the provided actor source code into the editor.
  3. Set the input configuration using the JSON structure shown above.
  4. Run the actor. It will fetch the media from the provided URL, transcribe it using the specified Whisper model, and return the results.

Important Notes

  • This script is provided "as is" without warranty. Use it at your own risk.
  • You are responsible for ensuring compliance with relevant terms of service (e.g., YouTube's ToS if you transcribe downloaded content) and copyright laws.
  • For transcribing YouTube videos, you must first download the video as an MP4 (or similar) file and host it at a publicly accessible URL.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Audio & Video to Text now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
donjuan_mime
Pricing
Paid
Total Runs
503
Active Users
64
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support