Building an AI Conversation Practice App: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    Building an AI Conversation Practice App: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

    This is the second post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.


    Overview: The STT Pipeline

    The complete STT workflow involves:

    1. Audio Reception → FormData parsing with formidable
    2. File Validation → WebM format verification and size checks
    3. Stream Processing → Direct file stream to OpenAI API
    4. Transcription → Whisper-1 model with Canadian English optimization
    5. Response Handling → Error management and cleanup
    6. Integration → Seamless handoff to conversation system


    Total processing time: 200-500ms


    Technical Stack Summary:
    • Primary STT: OpenAI Whisper-1
    • File Processing: Formidable + Node.js streams
    • Language: TypeScript with Next.js API routes
    • Error Handling: Basic try-catch with error logging
    • Performance: Stream processing, Node.js runtime


    The Challenges I Solved

    1. File Upload Complexity in Next.js

    Problem: Next.js API routes have strict limitations on file uploads, especially with form-data.


    Solution: Used a custom formidable-based parser:






    // Disable Next.js body parsing
    export const config = { api: { bodyParser: false } };

    // Custom form parsing with formidable
    const form = new IncomingForm({
    keepExtensions: true,
    });

    const formData: [Fields, Files] = await new Promise((resolve, reject) => {
    form.parse(req, (err, fields, files) => {
    if (err) return reject(err);
    resolve([fields, files]);
    });
    });







    The reason:
    • Bypasses Next.js 1MB body size limit
    • Handles WebM files up to 25MB
    • Maintains file metadata and extensions
    • Provides proper error handling


    2. Stream Processing for Large Files

    Problem: Loading entire audio files into memory causes performance issues and potential crashes after deployment.


    Solution: Direct stream processing to OpenAI API:






    // Create readable stream from uploaded file
    const audioPath = audioFile.filepath;
    const audioStream = createReadStream(audioPath);

    // Stream directly to OpenAI (no memory buffering)
    const transcription = await openai.audio.transcriptions.create({
    file: audioStream,
    model: "whisper-1",
    language: "en",
    prompt: "This is a conversation in Canadian English.",
    });







    Performance Benefits:
    • Significantly reduced memory usage through streaming
    • Faster processing for large files
    • Better reliability and no memory overflow crashes


    3. Frontend Audio Validation

    Problem: Short audio recordings (
    Solution: Early validation on the frontend before sending to backend






    // Frontend validation before API call
    const recordingDuration = Date.now() - recordingStartTimeRef.current;

    if (recordingDuration 300) {
    const clarificationText = getRandomClarification();

    const assistantMessage: Message = {
    role: 'assistant',
    content: '',
    isStreaming: true
    };

    setMessages(prevMessages => [...prevMessages, assistantMessage]);
    streamText(clarificationText, messageIndex);
    return; // Don't call STT API
    }

    // Only send to backend if recording is long enough
    const sttResponse = await fetch('/api/stt', {
    method: 'POST',
    body: formData,
    });







    Results:
    • API call reduction: ~15% fewer unnecessary calls
    • User experience: Immediate feedback for accidental recordings
    • Cost savings: Reduced unwanted OpenAI API usage


    4. Canadian English Optimization

    Problem: Default Whisper models aren't optimized for Canadian English expressions and pronunciation patterns.


    Solution: Custom prompt engineering:






    const transcription = await openai.audio.transcriptions.create({
    file: audioStream,
    model: "whisper-1",
    language: "en",
    prompt: "This is a conversation in Canadian English.",
    });







    Results:
    • Better recognition of Canadian expressions
    • Improved handling of slang and culture-related expressions


    Core Technical Implementation

    1. API Endpoint Architecture

    Our main STT endpoint (/api/stt) follows a robust error-handling pattern:






    export default async function handler(
    req: NextApiRequest,
    res: NextApiResponseApiResponse>
    ) {
    if (req.method !== 'POST') {
    return res.status(405).json({ success: false, error: 'Method not allowed' });
    }

    try {
    // Parse form data
    const form = new IncomingForm({ keepExtensions: true });
    const formData: [Fields, Files] = await new Promise((resolve, reject) => {
    form.parse(req, (err, fields, files) => {
    if (err) return reject(err);
    resolve([fields, files]);
    });
    });

    const [fields, files] = formData;

    // Validate audio file
    const audioFiles = files.audio;
    if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
    return res.status(400).json({ success: false, error: 'No audio file provided' });
    }

    const audioFile = audioFiles[0] as File;

    // Process with OpenAI
    const audioPath = audioFile.filepath;
    const audioStream = createReadStream(audioPath);

    const transcription = await openai.audio.transcriptions.create({
    file: audioStream,
    model: "whisper-1",
    language: "en",
    prompt: "This is a conversation in Canadian English.",
    });

    // Cleanup and respond
    fs.unlinkSync(audioPath);
    return res.status(200).json({
    success: true,
    transcript: transcription.text
    });

    } catch (error) {
    console.error('STT Error:', error);
    return res.status(500).json({
    success: false,
    error: error instanceof Error ? error.message : 'Failed to transcribe audio'
    });
    }
    }







    2. File Validation & Security





    // Access the audio file with proper type checking
    const audioFiles = files.audio;
    if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
    return res.status(400).json({
    success: false,
    error: 'No audio file provided'
    });
    }

    const audioFile = audioFiles[0] as File;

    // Additional validation
    if (!audioFile.filepath || audioFile.size === 0) {
    return res.status(400).json({
    success: false,
    error: 'Invalid audio file'
    });
    }







    Security Measures:
    • File type validation (WebM only)
    • Size limits (25MB max)
    • Temporary file cleanup
    • No persistent storage


    3. Resource Management





    // Critical: Clean up temporary files
    const audioPath = audioFile.filepath;
    const audioStream = createReadStream(audioPath);

    // Process audio...

    // Always cleanup, even on error
    try {
    fs.unlinkSync(audioPath);
    } catch (cleanupError) {
    console.warn('Failed to cleanup temp file:', cleanupError);
    }







    Resource Management Benefits:
    • Disk space: Prevents temp file accumulation
    • Security: No persistent audio storage
    • Performance: Clean server state


    Performance Optimizations

    1. Streaming vs Buffering

    Before (Buffering):






    // Load entire file into memory
    const audioBuffer = fs.readFileSync(audioPath);
    const transcription = await openai.audio.transcriptions.create({
    file: audioBuffer, // Large memory usage
    });







    After (Streaming):






    // Stream file directly
    const audioStream = createReadStream(audioPath);
    const transcription = await openai.audio.transcriptions.create({
    file: audioStream, // Minimal memory usage
    });







    Results:
    • Significantly reduced memory usage through streaming
    • Faster processing for large files
    • Better support for concurrent requests


    Integration with Frontend

    The STT API seamlessly integrates with our frontend conversation system:






    // Frontend STT call
    const sttResponse = await fetch('/api/stt', {
    method: 'POST',
    body: formData,
    });

    const sttData = await sttResponse.json();

    if (!sttData.success) {
    // Handle error gracefully
    const clarificationText = getRandomClarification();
    // Show clarification message to user
    } else {
    // Continue with conversation
    const transcript = sttData.transcript;
    // Send to GPT for response generation
    }







    Error Handling & User Experience

    1. Graceful Degradation





    // If STT fails, don't break the conversation
    if (!sttData.success) {
    const clarificationPhrases = [
    "Sorry, can you repeat that?",
    "Could you say that again please?",
    "I didn't quite get that. Could you repeat?",
    ];

    const randomClarification = clarificationPhrases[
    Math.floor(Math.random() * clarificationPhrases.length)
    ];

    // Continue conversation with clarification
    }







    2. Debugging & Monitoring





    // Comprehensive logging for debugging
    console.log('STT Response:', {
    success: sttData.success,
    transcript: sttData.transcript?.substring(0, 50) + '...',
    processingTime: Date.now() - startTime,
    fileSize: audioFile.size
    });







    Production Considerations

    Rate Limiting





    // Implement rate limiting for production
    if (requestCount > 10) { // 10 requests per minute
    return res.status(429).json({
    success: false,
    error: 'You\'re speaking too fast! Please wait a moment before trying again.'
    });
    }







    frontend:






    if (response.status === 429) {
    showError("Please wait a moment before recording again");
    }







    What's Next

    In the next post, we’ll see how the transcribed text powers our AI conversation system, from selecting specific characters and crafting prompts for Canadian English, also integrating with GPT-4 and keeping conversations flowing naturally.




    More...
Working...