Clinical Speech ML Pipeline
Mobile speech capture, FastAPI ingestion, and acoustic analysis pipeline.
- Status
- Archived · Kong Labs, Purdue
- Role
- Undergraduate Research Assistant
- Timeline
- Feb–Aug 2025
- Stack
- Python · Expo · FastAPI · Whisper · Librosa · FFmpeg · Postgres · asyncpg
01Problem
Capturing and analysing clinical speech data requires a pipeline that handles mobile audio recording, reliable upload, audio preprocessing, transcription, and structured persistence — without requiring expensive clinic-side infrastructure.
Speech quality varies across capture environments (different devices, rooms, microphones). Normalisation before analysis is necessary to produce comparable acoustic metrics across sessions.
02Mobile recording flow
An Expo mobile app handles speech capture on iOS and Android. Sessions are recorded as audio files and uploaded to the FastAPI backend via multipart HTTP upload.
The app sends session metadata (participant ID, session type, recording parameters) alongside the audio payload. The backend stages each upload before passing it to the analysis pipeline.
03Upload pipeline and normalisation
The FastAPI backend receives multipart uploads and stages audio files for processing. FFmpeg normalises each file — adjusting loudness, sample rate, and channel format — before it enters the analysis pipeline.
Consistent normalisation is the step that makes Librosa metrics comparable across sessions. Without it, variation in capture environment (device, room acoustics, distance from mic) dominates the acoustic signal. asyncpg provides connection pooling for concurrent session writes.
04Transcription and acoustic analysis
Whisper handles transcription — producing a text transcript with word-level timestamps. Librosa computes acoustic metrics from the normalised audio: speaking rate, pitch statistics, pause distribution, and articulation rate.
These metrics are stored per-session in Postgres alongside the transcript. The combination supports downstream analysis of speech patterns across sessions and participants.
05Persistence and analytics
Postgres stores session metadata, upload records, transcripts, and per-metric acoustic scores. An analytics endpoint aggregates metrics across sessions for a given participant — producing summary statistics for review.
Health check endpoints cover database connectivity and pipeline service availability.
06Limitations
No authentication. The API does not implement participant or researcher authentication. A production clinical deployment requires authentication and role-based access control.
No privacy or compliance guarantees. The pipeline does not implement HIPAA-compliant data handling, consent management, or data minimisation. It is a research prototype, not a clinical tool.
Heuristic acoustic metrics. Speaking rate and articulation rate are computed using heuristic methods from Librosa — research-grade approximations, not clinically validated measures.
Prototype-stage evaluation. The pipeline has not been validated for reliability at scale or against clinical ground truth. Performance claims from other research contexts should not be attributed to this implementation.
07What I learned
Preprocessing reproducibility matters more than analysis sophistication. Consistent FFmpeg normalisation was the step that made Librosa metrics meaningful — without it, capture environment variation dominated the signal.
Separating upload, normalisation, and analysis stages made each independently testable and debuggable. A monolithic pipeline would have obscured which stage produced unexpected results.