DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
This is the 0.9.1 release of Deep Speech, an open speech-to-text engine. In accord with semantic versioning, this version is not completely backwards compatible with earlier versions. However, models exported for 0.7.X and 0.8.X should work with this release. This is a bugfix release and retains compatibility with the 0.9.0 models. All model files included here are identical to the ones in the 0.9.0 release. As with previous releases, this release includes the source code:
The model files with the ".pbmm" extension are memory mapped and thus memory efficient and fast to load. The model files with the ".tflite" extension are converted to use TensorFlow Lite, has post-training quantization enabled, and are more suitable for resource constrained environments.
The acoustic models were trained on American English with synthetic noise augmentation and the .pbmm model achieves an 7.06% word error rate on the LibriSpeech clean test corpus.
Note that the model currently performs best in low-noise environments with clear recordings and has a bias towards US male accents. This does not mean the model cannot be used outside of these conditions, but that accuracy may be lower. Some users may need to train the model further to meet their intended use-case.
which are under the MPL-2.0 license and can be used as the basis for further fine-tuning.
Notable changes from the previous release
Fixed problem with documentation build on ReadTheDocs.org (#3399)
Training Regimen + Hyperparameters for fine-tuning
The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the training regimen, hardware used (a server with 8 Quadro RTX 6000 GPUs each with 24GB of VRAM), and our use of cuDNN RNN.
In contrast to some previous releases, training for this release occurred as a fine tuning of the previous 0.8.2 checkpoint, with data augmentation options enabled. The following hyperparameters were used for the fine tuning. See the 0.8.2 release notes for the hyperparameters used for the base model.
augmentoverlay[p=0.9,source=${noise},layers=1,snr=12~4] (where ${noise} is a dataset of Freesound.org background noise recordings)
augmentoverlay[p=0.1,source=${voices},layers=10~2,snr=12~4] (where ${voices} is a dataset of audiobook snippets extracted from Librivox)
augmentresample[p=0.2,rate=12000~4000]
augmentcodec[p=0.2,bitrate=32000~16000]
augmentreverb[p=0.2,decay=0.7~0.15,delay=10~8]
augmentvolume[p=0.2,dbfs=-10~10]
cache_for_epochs 10
The weights with the best validation loss were selected at the end of 200 epochs using --noearly_stop.
The optimal lm_alpha and lm_beta values with respect to the LibriSpeech clean dev corpus remain unchanged from the previous release:
lm_alpha 0.931289039105002
lm_beta 1.1834137581510284
For the Mandarin Chinese model, the following values are recommended:
lm_alpha 0.6940122363709647
lm_beta 4.777924224113021
Bindings
This release also includes a Python based command line tool deepspeech, installed through
pip install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
pip install deepspeech-gpu
On Linux, macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:
pip install deepspeech-tflite
Also, it exposes bindings for the following languages
Python (Versions 3.5, 3.6, 3.7 and 3.8) installed via
pip install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
pip install deepspeech-gpu
On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:
pip install deepspeech-tflite
NodeJS (Versions 10.x, 11.x, 12.x, 13.x, 14.x and 15.x) installed via
npm install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
npm install deepspeech-gpu
On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:
npm install deepspeech-tflite
ElectronJS versions 5.0, 6.0, 6.1, 7.0, 7.1, 8.0, 9.0, 9.1, 9.2, 10.0 and 10.1 are also supported
C which requires the appropriate shared objects are installed from native_client.tar.xz (See the section in the main README which describes native_client.tar.xz installation.)
FAQ - We have a list of common questions, and their answers, in our FAQ. When just getting started, it's best to first check the FAQ to see if your question is addressed.
Matrix - If your question is not addressed by either the FAQ or Discourse Forums, you can contact us on the #machinelearning:mozilla.org channel on Mozilla Matrix; people there can try to answer/help
Issues - Finally, if all else fails, you can open an issue in our repo if there is a bug with the current code base.