Efficient Minimum Word Error Rate Training for Attention-Based Models

RNN-T model is typically trained to optimize a cross-entropy criterion corresponding to improving the log-likelihood of the data. However, the model is usually evaluated using a different metric, namely the Word Error Rate (WER).

In this work, we present a TensorFlow implementation of an alternative training procedure allowing for direct optimization of the latter metric. Moreover, we incorporate in the solution a modified version of RNN-T Loss, called Monotonic RNN-T Loss, which is used to regularize the novel training procedure. The Monotonic RNN-T loss, as opposed to standard RNN-T, enforces strictly monotonic alignments between the input and output sequences allowing for faster decoding, crucial for MWER training.

Paper link Git repository