# Building a YouTube Transcript Pipeline

A Python CLI that generates structured transcripts from any YouTube video.

The tool follows a simple strategy: if a video already contains captions, it downloads and parses them. If captions are unavailable, it automatically falls back to local speech-to-text by downloading the video, extracting its audio, and transcribing it with **faster-whisper**. Regardless of the execution path, both outputs are normalized into the same timestamped transcript format.

The primary design goals are:

*   **Offline-first** transcription
    
*   **No paid APIs**
    
*   **Minimal external dependencies**
    
*   **Consistent output format** for downstream NLP and retrieval tasks
    

### Pipeline Overview

The pipeline begins by accepting a YouTube URL and retrieving the video's metadata. If subtitles are available, they are downloaded as an SRT file and parsed directly into transcript segments.

When subtitles are unavailable, the application downloads the video, extracts its audio using FFmpeg, and transcribes it locally using **faster-whisper**.

Both execution paths converge into a common normalization stage where every transcript is represented as an array of timestamped segments. Finally, the processed transcript is exported in two formats:

*   `transcript.txt` — human-readable transcript
    
*   `transcript.json` — structured JSON for programmatic use
    

### Local Speech-to-Text with faster-whisper

When subtitles are unavailable, transcription is performed locally using **faster-whisper**.

Unlike OpenAI's reference Whisper implementation, faster-whisper runs inference through **CTranslate2**, a highly optimized C++ inference engine. By default it quantizes model weights to **INT8**, significantly reducing memory usage while improving inference speed with minimal accuracy loss on clean audio.

```plaintext
from faster_whisper import WhisperModel

model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8",
)
```

Transcription is straightforward:

```plaintext
def transcribe_audio(audio_path):
    whisper_segments, info = model.transcribe(
        audio_path,
        vad_filter=True
    )

    print(
        "Detected language '%s' with probability %f"
        % (info.language, info.language_probability)
    )

    parsed_segments = []

    for segment in whisper_segments:
        parsed_segments.append({
            "start": segment.start,
            "end": segment.end,
            "text": segment.text.strip()
        })

    return parsed_segments
```

The model returns an iterator of timestamped segments along with metadata such as the detected language and confidence score. Each segment is converted into a normalized schema so that subtitle-based and speech-to-text transcripts share an identical structure.

### Audio Extraction with FFmpeg

Whisper expects **mono, 16 kHz WAV** input. Instead of decoding an entire video, the pipeline extracts only the audio stream, minimizing unnecessary disk I/O and processing overhead.

```python
import subprocess
from pathlib import Path


def extract_audio(video_path, output_dir):
    output_path = Path(output_dir) / "audio.wav"

    command = [
        "ffmpeg",
        "-y",
        "-i", video_path,
        "-vn",
        "-ac", "1",
        "-ar", "16000",
        str(output_path),
    ]

    try:
        subprocess.run(
            command,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.PIPE,
            check=True,
        )

        print("Audio extracted successfully!")
        return output_path

    except FileNotFoundError:
        raise RuntimeError(
            "FFmpeg is not installed or available in PATH."
        )

    except subprocess.CalledProcessError as error:
        print(f"Error extracting audio: {error}")
```

The important FFmpeg flags are:

| Flag | Purpose |
| --- | --- |
| `-vn` | Discards the video stream entirely |
| `-ac 1` | Downmixes audio to mono |
| `-ar 16000` | Resamples audio to 16 kHz |

Because only the audio stream is processed, extraction is significantly faster than decoding the entire video. The generated WAV file can then be passed directly to faster-whisper for transcription.

# Unified Transcript Schema

Whether the transcript originates from YouTube captions or local speech recognition, both are normalized into the same structure before being written to disk.

```plaintext
{
  "segments": [
    {
      "start": 0.0,
      "end": 4.32,
      "text": "Hello everyone."
    },
    {
      "start": 4.33,
      "end": 9.01,
      "text": "Today we're looking at..."
    }
  ]
}
```

A plain-text version is also generated:

```plaintext
[0.00 → 4.32]
Hello everyone.

[4.33 → 9.01]
Today we're looking at...
```

Representing timestamps as floating-point seconds makes the output immediately usable for downstream tasks such as:

*   RAG document chunking
    
*   Semantic search
    
*   Subtitle generation
    
*   Speaker diarization
    
*   Video indexing
    

### Tech Stack

**yt-dlp**

Downloads YouTube videos and extracts subtitle tracks (`.srt`/`.vtt`) when available. It also handles format selection, authentication cookies, and rate limiting.

**faster-whisper**

A CTranslate2-powered implementation of OpenAI's Whisper model that performs fully local speech-to-text inference. It supports quantized models, language detection, and timestamped transcription without requiring an API key.

**FFmpeg**

A system-level multimedia toolkit used to extract and convert audio. In this project it is invoked via Python's `subprocess` module rather than through a Python wrapper.

**Python 3.8+**

Provides the orchestration layer, using:

*   `argparse` for the CLI
    
*   `pathlib` for cross-platform file handling
    
*   `json` for transcript serialization
    

### **Running the Project**

1.  Create a virtual environment
    
    ```python
    python3 -m venv venv
    ```
    
2.  Activate it
    

*Linux / macOS*

```python
source venv/bin/activate
```

*Windows (Command Prompt)*

```python
venv\Scripts\activate
```

*Windows (PowerShell)*

```python
venv\Scripts\Activate.ps1 
```

3\. Install dependencies

```python
pip install -r requirements.txt 
```

4\. Run the application

```python
python3 main.py --url <youtube_video_url>
```

Example:

```python
python3 main.py --url "https://youtu.be/jNQXAC9IVRw?si=-fcBEgeyhDUJw47M"

```