Whisper model is large scale weakly supervised training ASR model from OpenAI. Whisper model encoder is widely used in speech tokenization.
Data Processing
- Construct the dataset from audio that is paired with transcripts on the Internet
- Filter out machine generated data as other ASR systems generated data can significantly impair the performance of translation systems
- Audio language detector
- Break audio files into 30-second segments paired with transcript in the time segment
- Trained model with segment without speech, but at a reduced data sampling rate
- De-duplication at a transcript level between train and eval datasets
Model
The model architecture is encoder-decoder Transformer. Notice that this is different from LLM training where most models are decoder-only model. The reason is for ASR task, the whole audio segment is available before transcribe.
The data flow is as follows
- extract fbank features using a window length of 25ms and a stride of 10ms;
- pass the fbank features through two convolutional layers (to reduce feature complexity, the second convolution uses a stride of 2 for 2x downsampling) and add positional encoding;
- pass them through a standard Transformer encoder to perform self-attention and obtain the audio’s encoder hidden state;
- decoder’s autoregressive decoding
Training takes into consideration of multiple tasks such as transcription, translation, voice activity detection, alignment, and language identification etc.
To handle multitask processing, the output is design as a unified generation format. For instance, the output is conditioned on history of text of the transcript (the transcript text preceding the current audio segment). Generation includes:
- Language ID
- <|nospeech|> token for no speech segment
- <|transcribe|> or <|translate|> for text generation (next token prediction)
References
- Robust Speech Recognition via Large-Scale Weak Supervision
...