NLP, Retrieval & RAG | MLNotebooks Tools

Language models

ALBERT
Google lite BERT variant with parameter sharing and factorized embeddings for efficient NLP pretraining.
BERT
Google bidirectional Transformer language model for pretraining contextual representations for NLP tasks.
DeBERTa
Microsoft disentangled-attention Transformer models for improved natural language understanding.
DistilBERT
Hugging Face distilled BERT family providing smaller faster Transformer models for NLP.
ELECTRA
Google pretraining method using replaced-token detection for sample-efficient Transformer language representations.
GPT-2
OpenAI autoregressive Transformer language model for text generation and language modeling research.
Mamba
Selective state space model architecture for efficient long-sequence language modeling.
RoBERTa
Meta optimized BERT pretraining recipe and models implemented in fairseq for robust NLP representations.
T5
Google text-to-text Transformer framework and models casting NLP tasks into a unified sequence generation format.
XLNet
Permutation language modeling Transformer-XL based model for generalized autoregressive pretraining.

NLP libraries

AllenNLP
AllenAI research library for building and evaluating deep learning models for NLP.
Fairseq
Facebook AI sequence modeling toolkit for translation and text generation research.
Flair
NLP framework for state-of-the-art sequence labeling and embeddings.
Gensim
Python library for topic modeling document similarity and training unsupervised vector space representations at scale.
Hugging Face Tokenizers
Hugging Face Tokenizers provides fast modern tokenizers used for NLP model preprocessing.
KenLM
KenLM is a toolkit for building querying and using statistical language models.
Marian NMT
Efficient neural machine translation framework written in C++ for research and production use.
NLTK
Python toolkit providing corpora lexical resources and classic NLP algorithms for language processing research and teaching.
OpenNMT
Open-source ecosystem for neural machine translation and sequence learning toolkits.
SentencePiece
SentencePiece is an unsupervised text tokenizer and detokenizer for neural text generation systems.
spaCy
Industrial strength Python NLP library for tokenization tagging parsing named entity recognition and production pipelines.
Stanza
Stanford NLP Python library with neural pipelines for tokenization POS tagging parsing NER and sentiment across many languages.
TextBlob
Python library offering a simple API for common NLP tasks such as tagging and sentiment analysis.
tiktoken
tiktoken is a fast BPE tokenizer for OpenAI model text tokenization.

Embeddings

BGE
BAAI FlagEmbedding models and tools for dense retrieval and embedding generation, including the BGE cross-encoder rerankers.
Cohere Embed
Cohere Embed models and API for semantic search RAG classification and clustering.
E5
Microsoft EmbEddings from bidirEctional Encoder representations for text embedding and retrieval tasks.
ELMo
AllenNLP contextual word representation model using deep bidirectional language models.
fastText
Meta library for efficient text classification and word representations using subword information.
GloVe
Stanford unsupervised word embedding algorithm based on global word-word co-occurrence statistics.
GTE
Alibaba DAMO general text embedding model for semantic similarity and dense retrieval.
Jina Embeddings
Jina AI embedding models and API for multilingual multimodal and long-context retrieval use cases.
Nomic Embed
Nomic open text embedding model for long-context semantic search and retrieval.
OpenAI text-embedding-3
OpenAI embedding model family for converting text into vectors for search clustering and retrieval.
sentence-transformers
Python framework for sentence text and image embeddings using Transformer models.
text-embeddings-inference
High-performance Hugging Face inference server for text embeddings and reranking models.
txtai
All-in-one embeddings database for semantic search and language model workflows.
Voyage AI
Voyage AI embedding and reranking models for retrieval search and RAG applications.
word2vec
Google toolkit for learning efficient word vector representations from large text corpora.

Vector databases & indexes

Annoy
Spotify Annoy is a C++ and Python library for approximate nearest-neighbor search with memory-mapped indexes.
Chroma
Open source AI application database for embeddings vector search and retrieval workflows
Elasticsearch
Elasticsearch is Elastic's distributed search and analytics engine for full-text, vector, and hybrid retrieval workloads.
FAISS
Facebook AI Similarity Search is a library for efficient similarity search and clustering of dense vectors.
HNSWlib
HNSWlib is a lightweight C++ and Python library for approximate nearest-neighbor search using HNSW graphs.
LanceDB
Open-source vector database for AI applications built on the Lance columnar data format.
Marqo
Marqo is an AI-native search engine and API for multimodal vector search and retrieval.
pgvector
PostgreSQL extension that adds vector similarity search for embeddings inside Postgres
Pinecone
Managed vector database for building search recommendation and RAG applications at scale
Qdrant
Vector similarity search engine and database for high performance neural search and RAG systems
ScaNN
Google ScaNN performs efficient vector similarity search at scale for maximum inner product and nearest-neighbor queries.
Turbopuffer
Serverless vector database focused on low cost large scale similarity search
Vespa
Search and serving engine for vector search, recommendation, and large-scale inference.
Weaviate
Open source vector database with hybrid search generative search and scalable AI native data storage

Retrieval engines & rerankers

Anserini
Anserini is a Lucene based toolkit for reproducible information retrieval research.
Apache Lucene
Apache Lucene is a high-performance Java search library for indexing and ranked retrieval.
BM25 (Okapi)
Classic probabilistic bag-of-words ranking function used in search engines and information retrieval.
Cohere Rerank
Cohere Rerank is an API model for reordering search results and documents by semantic relevance to a query.
ColBERT
Stanford late-interaction neural retrieval model for efficient and effective passage search.
DPR
Meta Dense Passage Retrieval implementation for open-domain question answering.
OpenSearch
Open-source search and analytics suite with full-text search and vector search capabilities.
Pyserini
Pyserini is a Python toolkit for reproducible information retrieval research and retrieval pipelines.
SPLADE
NAVER sparse lexical and expansion model for neural information retrieval.
Tantivy
Tantivy is a Rust full-text search engine library inspired by Apache Lucene and used to build search systems.
Terrier
Terrier is an open source search engine and information retrieval platform.
Whoosh
Whoosh is a pure Python library for indexing text and searching indexed content.

Agents & RAG frameworks

AutoGen
Microsoft AutoGen is a framework for building and evaluating multi agent AI applications.
CrewAI
CrewAI is a Python framework and platform for orchestrating role based multi agent automations.
DSPy
Stanford framework for programming and optimizing language-model pipelines with declarative modules.
Guidance
Guidance lets developers control language models with constrained generation and structured prompting.
Haystack
Deepset's open-source framework for building production-ready LLM applications and RAG pipelines.
LangChain
LangChain is a framework for building LLM applications with chains agents retrieval and integrations.
Letta
Letta is a framework and platform for building stateful agents with memory and tool use.
LlamaIndex
LlamaIndex is a framework for connecting data to LLM apps with agents workflows and RAG pipelines.
Mastra
Mastra is a TypeScript agent framework for workflows memory evals and integrations in AI apps.
NeMo Guardrails
NVIDIA framework for adding programmable guardrails and safety controls to conversational AI apps.
OpenAI Agents SDK
OpenAI Agents SDK is a Python toolkit for building agentic applications with tools handoffs and tracing.
Outlines
Library for structured text generation with LLMs using regexes and type constraints.
Semantic Kernel
Microsoft SDK for orchestrating AI agents and integrating LLMs with application workflows.

Evaluation & benchmarks

BEIR
Heterogeneous benchmark suite and codebase for zero shot information retrieval evaluation.
DeepEval
Open-source LLM evaluation framework for testing RAG and language-model applications.
Giskard
Giskard provides testing and evaluation tools to detect risks in AI models and LLM applications.
GLUE
General Language Understanding Evaluation benchmark suite for natural language understanding systems.
MS MARCO
Microsoft Machine Reading Comprehension dataset and benchmark for passage ranking and QA tasks.
MTEB
Massive Text Embedding Benchmark for evaluating text embedding models across many tasks.
Promptfoo
Open-source tool for testing and red-teaming prompts and LLM applications.
RAGAS
Framework for evaluating retrieval augmented generation and LLM applications with metrics and test data generation.
SQuAD
Stanford Question Answering Dataset benchmark for reading comprehension question answering systems.
SuperGLUE
More challenging language understanding benchmark suite building on GLUE for NLU evaluation.
SWE-bench
Benchmark for resolving real GitHub software issues using language models and coding agents.
trec_eval
trec_eval is the NIST tool for evaluating ad hoc retrieval runs using TREC measures.