Speech | MLNotebooks Tools

Speech recognition (ASR)

Conformer
Convolution augmented Transformer architecture for speech recognition that combines attention and convolution.
DeepSpeech
Mozilla open source speech to text engine based on the Deep Speech architecture.
HuBERT
Meta HuBERT self supervised speech representation learning model for automatic speech recognition.
wav2vec 2.0
Meta wav2vec 2.0 self supervised speech representation model for ASR pretraining and fine tuning.
Whisper
OpenAI Whisper is an open source automatic speech recognition and translation model.

Bark
Suno text prompted generative audio model for speech music background noise and simple sound effects.
Coqui TTS
Deep learning toolkit for training and running text to speech voice cloning and speech synthesis models.
FastSpeech 2
PyTorch implementation of FastSpeech 2 for fast non-autoregressive neural text to speech synthesis.
Kokoro
Compact 82M parameter text to speech model distributed on Hugging Face for efficient speech synthesis.
Piper
Fast local neural text to speech system built around efficient voices for offline speech synthesis.
StyleTTS 2
Official StyleTTS 2 implementation for human level text to speech using style diffusion and adversarial training.
Tacotron 2
NVIDIA Tacotron 2 implementation for neural text to speech synthesis with spectrogram prediction.
VITS
Official VITS implementation for end to end text to speech with variational inference and adversarial learning.
XTTS
Coqui XTTS v2 multilingual voice generation model for cloning voices across languages from short audio prompts.

Deepgram
Deepgram provides speech to text text to speech and audio intelligence APIs for developers.
Kaldi
Open source speech recognition toolkit for training and decoding ASR systems.
NVIDIA NeMo
NVIDIA NeMo framework for building training and deploying conversational AI and generative AI models.
pyannote.audio
Open source Python toolkit and pretrained pipelines for speaker diarization and speech processing tasks.
SpeechBrain
Open source PyTorch toolkit for speech recognition speaker recognition separation and enhancement.