Whisper model is large scale weakly supervised training ASR model from OpenAI. Whisper model encoder is widely used in speech tokenization.
Data Processing
- Construct the dataset from audio that is paired with transcripts on the Internet
 - Filter out machine generated data as other ASR systems generated data can significantly impair the performance of translation systems
 - Audio language detector
 - Break audio files into 30-second segments paired with transcript in the time segment
 - Trained model with segment without speech, but at a reduced data sampling rate
 - De-duplication at a transcript level between train and eval datasets
 
Model
The model architecture is encoder-decoder Transformer. Notice that this is different from LLM training where most models are decoder-only model. The reason is for ASR task, the whole audio segment is available before transcription.
The data flow is as follows
- extract fbank features using a window length of 25ms and a stride of 10ms;
 - pass the fbank features through two convolutional layers (to reduce feature complexity, the second convolution uses a stride of 2 for 2x downsampling) and add positional encoding;
 - pass them through a standard Transformer encoder to perform self-attention and obtain the audio’s encoder hidden state;
 - decoder’s autoregressive decoding
 
Training takes into consideration of multiple tasks such as transcription, translation, voice activity detection, alignment, and language identification etc.

To handle multitask processing, the output is design as a unified generation format. For instance, the output is conditioned on history of text of the transcript (the transcript text preceding the current audio segment). Generation includes:
- Language ID
 - <|nospeech|> token for no speech segment
 - <|transcribe|> or <|translate|> for text generation (next token prediction)
 
References
- Robust Speech Recognition via Large-Scale Weak Supervision
 
...