Ctrl K

Speech

19 tools for speech.

Speech recognition (ASR)

  • Convolution augmented Transformer architecture for speech recognition that combines attention and convolution.

  • Mozilla open source speech to text engine based on the Deep Speech architecture.

  • Meta HuBERT self supervised speech representation learning model for automatic speech recognition.

  • Meta wav2vec 2.0 self supervised speech representation model for ASR pretraining and fine tuning.

  • OpenAI Whisper is an open source automatic speech recognition and translation model.

Speech synthesis (TTS)

  • Suno text prompted generative audio model for speech music background noise and simple sound effects.

  • Deep learning toolkit for training and running text to speech voice cloning and speech synthesis models.

  • PyTorch implementation of FastSpeech 2 for fast non-autoregressive neural text to speech synthesis.

  • Compact 82M parameter text to speech model distributed on Hugging Face for efficient speech synthesis.

  • Fast local neural text to speech system built around efficient voices for offline speech synthesis.

  • Official StyleTTS 2 implementation for human level text to speech using style diffusion and adversarial training.

  • NVIDIA Tacotron 2 implementation for neural text to speech synthesis with spectrogram prediction.

  • Official VITS implementation for end to end text to speech with variational inference and adversarial learning.

  • Coqui XTTS v2 multilingual voice generation model for cloning voices across languages from short audio prompts.

Toolkits & services

  • Deepgram provides speech to text text to speech and audio intelligence APIs for developers.

  • Open source speech recognition toolkit for training and decoding ASR systems.

  • NVIDIA NeMo framework for building training and deploying conversational AI and generative AI models.

  • Open source Python toolkit and pretrained pipelines for speaker diarization and speech processing tasks.

  • Open source PyTorch toolkit for speech recognition speaker recognition separation and enhancement.