Building an AI Conversation Practice App: Part 2 - Backend Speech-to-Text Processing with OpenAI Whisper

**MyrinNew** · 09-21-2025, 12:00 AM

This is the second post in a series documenting the technical implementation of a browser-based English learning application with real-time speech processing capabilities.

Overview: The STT Pipeline

The complete STT workflow involves:

Audio Reception → FormData parsing with formidable
File Validation → WebM format verification and size checks
Stream Processing → Direct file stream to OpenAI API
Transcription → Whisper-1 model with Canadian English optimization
Response Handling → Error management and cleanup
Integration → Seamless handoff to conversation system

Total processing time: 200-500ms

Technical Stack Summary:

Primary STT: OpenAI Whisper-1
File Processing: Formidable + Node.js streams
Language: TypeScript with Next.js API routes
Error Handling: Basic try-catch with error logging
Performance: Stream processing, Node.js runtime

The Challenges I Solved

1. File Upload Complexity in Next.js

Problem: Next.js API routes have strict limitations on file uploads, especially with form-data.

Solution: Used a custom formidable-based parser:

// Disable Next.js body parsing
export const config = { api: { bodyParser: false } };

// Custom form parsing with formidable
const form = new IncomingForm({
keepExtensions: true,
});

const formData: [Fields, Files] = await new Promise((resolve, reject) => {
form.parse(req, (err, fields, files) => {
if (err) return reject(err);
resolve([fields, files]);
});
});

The reason:

Bypasses Next.js 1MB body size limit
Handles WebM files up to 25MB
Maintains file metadata and extensions
Provides proper error handling

2. Stream Processing for Large Files

Problem: Loading entire audio files into memory causes performance issues and potential crashes after deployment.

Solution: Direct stream processing to OpenAI API:

// Create readable stream from uploaded file
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Stream directly to OpenAI (no memory buffering)
const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});

Performance Benefits:

Significantly reduced memory usage through streaming
Faster processing for large files
Better reliability and no memory overflow crashes

3. Frontend Audio Validation

Problem: Short audio recordings (
Solution: Early validation on the frontend before sending to backend

// Frontend validation before API call
const recordingDuration = Date.now() - recordingStartTimeRef.current;

if (recordingDuration 300) {
const clarificationText = getRandomClarification();

const assistantMessage: Message = {
role: 'assistant',
content: '',
isStreaming: true
};

setMessages(prevMessages => [...prevMessages, assistantMessage]);
streamText(clarificationText, messageIndex);
return; // Don't call STT API
}

// Only send to backend if recording is long enough
const sttResponse = await fetch('/api/stt', {
method: 'POST',
body: formData,
});

Results:

API call reduction: ~15% fewer unnecessary calls
User experience: Immediate feedback for accidental recordings
Cost savings: Reduced unwanted OpenAI API usage

4. Canadian English Optimization

Problem: Default Whisper models aren't optimized for Canadian English expressions and pronunciation patterns.

Solution: Custom prompt engineering:

const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});

Results:

Better recognition of Canadian expressions
Improved handling of slang and culture-related expressions

Core Technical Implementation

1. API Endpoint Architecture

Our main STT endpoint (/api/stt) follows a robust error-handling pattern:

export default async function handler(
req: NextApiRequest,
res: NextApiResponseApiResponse>
) {
if (req.method !== 'POST') {
return res.status(405).json({ success: false, error: 'Method not allowed' });
}

try {
// Parse form data
const form = new IncomingForm({ keepExtensions: true });
const formData: [Fields, Files] = await new Promise((resolve, reject) => {
form.parse(req, (err, fields, files) => {
if (err) return reject(err);
resolve([fields, files]);
});
});

const [fields, files] = formData;

// Validate audio file
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
return res.status(400).json({ success: false, error: 'No audio file provided' });
}

const audioFile = audioFiles[0] as File;

// Process with OpenAI
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

const transcription = await openai.audio.transcriptions.create({
file: audioStream,
model: "whisper-1",
language: "en",
prompt: "This is a conversation in Canadian English.",
});

// Cleanup and respond
fs.unlinkSync(audioPath);
return res.status(200).json({
success: true,
transcript: transcription.text
});

} catch (error) {
console.error('STT Error:', error);
return res.status(500).json({
success: false,
error: error instanceof Error ? error.message : 'Failed to transcribe audio'
});
}
}

2. File Validation & Security

// Access the audio file with proper type checking
const audioFiles = files.audio;
if (!audioFiles || !Array.isArray(audioFiles) || audioFiles.length === 0) {
return res.status(400).json({
success: false,
error: 'No audio file provided'
});
}

const audioFile = audioFiles[0] as File;

// Additional validation
if (!audioFile.filepath || audioFile.size === 0) {
return res.status(400).json({
success: false,
error: 'Invalid audio file'
});
}

Security Measures:

File type validation (WebM only)
Size limits (25MB max)
Temporary file cleanup
No persistent storage

3. Resource Management

// Critical: Clean up temporary files
const audioPath = audioFile.filepath;
const audioStream = createReadStream(audioPath);

// Process audio...

// Always cleanup, even on error
try {
fs.unlinkSync(audioPath);
} catch (cleanupError) {
console.warn('Failed to cleanup temp file:', cleanupError);
}

Resource Management Benefits:

Disk space: Prevents temp file accumulation
Security: No persistent audio storage
Performance: Clean server state

Performance Optimizations

1. Streaming vs Buffering

Before (Buffering):

// Load entire file into memory
const audioBuffer = fs.readFileSync(audioPath);
const transcription = await openai.audio.transcriptions.create({
file: audioBuffer, // Large memory usage
});

After (Streaming):

// Stream file directly
const audioStream = createReadStream(audioPath);
const transcription = await openai.audio.transcriptions.create({
file: audioStream, // Minimal memory usage
});

Results:

Significantly reduced memory usage through streaming
Faster processing for large files
Better support for concurrent requests

Integration with Frontend

The STT API seamlessly integrates with our frontend conversation system:

// Frontend STT call
const sttResponse = await fetch('/api/stt', {
method: 'POST',
body: formData,
});

const sttData = await sttResponse.json();

if (!sttData.success) {
// Handle error gracefully
const clarificationText = getRandomClarification();
// Show clarification message to user
} else {
// Continue with conversation
const transcript = sttData.transcript;
// Send to GPT for response generation
}

Error Handling & User Experience

1. Graceful Degradation

// If STT fails, don't break the conversation
if (!sttData.success) {
const clarificationPhrases = [
"Sorry, can you repeat that?",
"Could you say that again please?",
"I didn't quite get that. Could you repeat?",
];

const randomClarification = clarificationPhrases[
Math.floor(Math.random() * clarificationPhrases.length)
];

// Continue conversation with clarification
}

2. Debugging & Monitoring

// Comprehensive logging for debugging
console.log('STT Response:', {
success: sttData.success,
transcript: sttData.transcript?.substring(0, 50) + '...',
processingTime: Date.now() - startTime,
fileSize: audioFile.size
});

Production Considerations

Rate Limiting

// Implement rate limiting for production
if (requestCount > 10) { // 10 requests per minute
return res.status(429).json({
success: false,
error: 'You\'re speaking too fast! Please wait a moment before trying again.'
});
}

frontend：

if (response.status === 429) {
showError("Please wait a moment before recording again");
}

What's Next

In the next post, we’ll see how the transcribed text powers our AI conversation system, from selecting specific characters and crafting prompts for Canadian English, also integrating with GPT-4 and keeping conversations flowing naturally.

More...