Speech recognition (ASR)
Convolution augmented Transformer architecture for speech recognition that combines attention and convolution.
Mozilla open source speech to text engine based on the Deep Speech architecture.
Meta HuBERT self supervised speech representation learning model for automatic speech recognition.
Meta wav2vec 2.0 self supervised speech representation model for ASR pretraining and fine tuning.
OpenAI Whisper is an open source automatic speech recognition and translation model.
Speech synthesis (TTS)
Suno text prompted generative audio model for speech music background noise and simple sound effects.
Deep learning toolkit for training and running text to speech voice cloning and speech synthesis models.
PyTorch implementation of FastSpeech 2 for fast non-autoregressive neural text to speech synthesis.
Compact 82M parameter text to speech model distributed on Hugging Face for efficient speech synthesis.
Fast local neural text to speech system built around efficient voices for offline speech synthesis.
Official StyleTTS 2 implementation for human level text to speech using style diffusion and adversarial training.
NVIDIA Tacotron 2 implementation for neural text to speech synthesis with spectrogram prediction.
Official VITS implementation for end to end text to speech with variational inference and adversarial learning.
Coqui XTTS v2 multilingual voice generation model for cloning voices across languages from short audio prompts.
Toolkits & services
Deepgram provides speech to text text to speech and audio intelligence APIs for developers.
Open source speech recognition toolkit for training and decoding ASR systems.
NVIDIA NeMo framework for building training and deploying conversational AI and generative AI models.
Open source Python toolkit and pretrained pipelines for speaker diarization and speech processing tasks.
Open source PyTorch toolkit for speech recognition speaker recognition separation and enhancement.