We read every piece of feedback, and take your input very seriously. To see all available qualifiers, see our documentation. This is a template of S2T1 recipe for ESPnet2. It is based on ASR1, but follows the style of OpenAI’s Whisper to train a single encoder-decoder model for various speech processing tasks. Specifically, it uses special tokens as task specifiers (e.g., transcribe, translate) or prediction targets (e.g., language ID) so that a single model can perform multiple tasks for multiple languages. It further supports conditional generation where the condition is the previous sentence within the long talk.More details can be found in our OWSM paper (ASRU 2023).Table of ContentsS2T1 recipe consists of 16 stages.Data preparation stage.It calls local/data.sh to creates Kaldi-style data directories in data/ for training and validation sets.The training data has the following format:where is a special token denoting the start of prev/prompt sentence. The timestamps are also treated as special tokens because the audio has a fixed length (30s) and resolution (20ms or 40ms). An example looks like:During data preparation, three text files should be generated:Further notes:Augment training data with speed perturbation. data/${train_set}_spXX would be generated (XX means the speed factor). This step is optional. Note that the timestamps need to be changed as well.Format the wave files in wav.scp to a single format (wav / flac / kaldi_ark).Remove too long or too short data.Generate token list from the training data. BPE tokens are used.Neural-network (NN) based Language model (LM) is optional for S2T1 task. You can skip stage 6-9 by setting --use_lm false. Statistics calculation stage. It collects the shape information of LM texts and calculates statistics for LM training.NN-based LM model training stage. You can change the training setting via --lm_config and --lm_args options.See also:NN-based LM evaluation stage. Perplexity (PPL) is computed against the trained modelSee also:N-gram-based LM model training stage.Statistics calculation stage. It collects the shape information of input and output texts for S2T training.S2T model training stage. You can change the training setting via --s2t_config and --s2t_args options.See also:S2T inference stage. We can perform ASR or ST using any prepared test data.Calculate ASR error rates (char / word / token).Packing stage. It packs the trained model files and uploads to Hugging Face.See also:We have created several recipes for OWSM training. Please check egs2/mixed_v1, egs2/mixed_v2, egs2/mixed_v3 for more information.Pre-trained OWSM can be fine-tuned on a specific dataset. Here, we use AISHELL-1 as an example.We use this s2t1 template to fine-tune OWSM. So we first create this directory under our custom dataset egs2/aishell.Then, we download a pre-trained model, e.g., espnet/owsm_v2_ebranchformer, using the following command:The downloaded model will be saved in local cache and then uncompressed. An exp directory will be automatically created which contains symbolic links to the checkpoint and config files.To use a pre-trained model, we need the following important files:The path to stats can be found in config.yaml, e.g.:The path to bpemodel can also be found in config.yaml, e.g.:In the following sections, we will manually copy those two files to correct places.The data should be prepared in the OWSM format. Please refer to 1. Data preparation for more information.Since AISHELL-1 has been included in OWSM v1, we can reuse those preparation scripts. For your own data, please write the scripts by yourself and make sure the special tokens such as the language codes are consistent with the pre-trained model. Note that we will NOT generate new vocabulary for fine-tuning. Instead, we will use the vocabulary from the pre-trained model.The prepared data will be stored in a new directory data.Next, we execute various stages in s2t.sh. To make it easier, we create a run.sh shown below. It is mostly copied from the OWSM v2 recipe.We run Stage 3 and Stage 4 to format data:We create the BPE token directory by ourselves. This is equivalent to Stage 5 but we do not generate a new token list.We create a training config file for fine-tuning. It is modified from the original config in config.yaml. Note that you may need to tune the training hyperparameters such as learning rate. The model might easily overfit to a small training set.We need to collect the shapes of speech and text, but skip the mean and variance collection because we use a pre-trained model. We run Stage 10 with a smaller batch size:Then, we copy the existing mean and variance to the correct place:Now, we can start training: