This Open-Source Pipeline Transforms Any Podcast into AI-Ready Transcripts with Speaker Diarization (MIT License)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    This Open-Source Pipeline Transforms Any Podcast into AI-Ready Transcripts with Speaker Diarization (MIT License)

    I stumbled upon this gem on GitHub and had to share it with the community. If you're working with podcasts, audio transcription, or building AI applications that need clean, structured audio data - this is going to save you weeks of work.


    What is be-flow-dtd?

    be-flow-dtd (Download → Transcribe → Diarize) is a production-ready podcast transcription pipeline that:
    • Automatically fetches new podcast episodes via the Taddy API
    • Transcribes audio with word-level timestamps using Whisper large-v3
    • Identifies who speaks when using Pyannote 3.1 speaker diarization
    • Matches voices against known speaker embeddings (ECAPA-TDNN)
    • Uploads structured JSON to Supabase cloud storage


    Why This Matters for AI Developers

    The output is perfectly formatted for LLM training, RAG systems, and semantic search:






    {
    "transcript": [
    {
    "text": "Hello and welcome to the show.",
    "start": 0.0,
    "end": 2.5,
    "speaker_id": "host-uuid",
    "speaker_name": "John Host",
    "words": [
    {"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98}
    ]
    }
    ]
    }







    The Architecture is Clean





    Taddy API → Download → Transcribe → Diarize → Identify → Upload
    (episodes) (yt-dlp) (Whisper) (Pyannote) (ECAPA) (Supabase)







    Each GPU model is loaded sequentially with explicit VRAM management - no more OOM errors!


    Key Features I Love

    Automatic Episode Discovery - Set it and forget it

    State Tracking - SQLite prevents reprocessing

    GPU Optimized - Works on 8GB VRAM cards

    Docker Ready - Deploy anywhere with docker-compose

    MIT Licensed - Use it however you want


    Quick Start





    git clone https://github.com/goonerstrike/be-flow-dtd
    cd be-flow-dtd
    pip install -r requirements.txt

    # Configure your API keys in .env
    python main.py --dry-run --verbose







    You'll need:
    • Taddy API key (podcast metadata)
    • HuggingFace token (for Pyannote models)
    • Supabase credentials (cloud storage)


    Built with CocoIndex

    The project uses CocoIndex for pipeline visualization and monitoring. Run cocoindex server -ci cocoindex_flow.py and connect to CocoInsight for real-time visibility into your data flows.


    Performance

    On an RTX 3090/4090, you can process ~100 hours of audio per day with the large-v3 model. For smaller GPUs (8GB), the medium model works great.





    GitHub: https://github.com/goonerstrike/be-flow-dtd


    If you're building anything with podcast data, audio transcription, or need speaker-attributed transcripts for your AI projects - definitely check this out. The structured JSON output is chef's kiss for downstream ML pipelines.


    Has anyone else been working on similar audio processing pipelines? Would love to hear about your approaches!




    More...
Working...