Ctrl K

NLP, Retrieval & RAG

90 tools for nlp, retrieval & rag.

Language models

  • Google lite BERT variant with parameter sharing and factorized embeddings for efficient NLP pretraining.

  • Google bidirectional Transformer language model for pretraining contextual representations for NLP tasks.

  • Microsoft disentangled-attention Transformer models for improved natural language understanding.

  • Hugging Face distilled BERT family providing smaller faster Transformer models for NLP.

  • Google pretraining method using replaced-token detection for sample-efficient Transformer language representations.

  • OpenAI autoregressive Transformer language model for text generation and language modeling research.

  • Selective state space model architecture for efficient long-sequence language modeling.

  • Meta optimized BERT pretraining recipe and models implemented in fairseq for robust NLP representations.

  • Google text-to-text Transformer framework and models casting NLP tasks into a unified sequence generation format.

  • Permutation language modeling Transformer-XL based model for generalized autoregressive pretraining.

NLP libraries

  • AllenAI research library for building and evaluating deep learning models for NLP.

  • Facebook AI sequence modeling toolkit for translation and text generation research.

  • NLP framework for state-of-the-art sequence labeling and embeddings.

  • Python library for topic modeling document similarity and training unsupervised vector space representations at scale.

  • Hugging Face Tokenizers provides fast modern tokenizers used for NLP model preprocessing.

  • KenLM is a toolkit for building querying and using statistical language models.

  • Efficient neural machine translation framework written in C++ for research and production use.

  • Python toolkit providing corpora lexical resources and classic NLP algorithms for language processing research and teaching.

  • Open-source ecosystem for neural machine translation and sequence learning toolkits.

  • SentencePiece is an unsupervised text tokenizer and detokenizer for neural text generation systems.

  • Industrial strength Python NLP library for tokenization tagging parsing named entity recognition and production pipelines.

  • Stanford NLP Python library with neural pipelines for tokenization POS tagging parsing NER and sentiment across many languages.

  • Python library offering a simple API for common NLP tasks such as tagging and sentiment analysis.

  • tiktoken is a fast BPE tokenizer for OpenAI model text tokenization.

Embeddings

  • BAAI FlagEmbedding models and tools for dense retrieval and embedding generation, including the BGE cross-encoder rerankers.

  • Cohere Embed models and API for semantic search RAG classification and clustering.

  • Microsoft EmbEddings from bidirEctional Encoder representations for text embedding and retrieval tasks.

  • AllenNLP contextual word representation model using deep bidirectional language models.

  • Meta library for efficient text classification and word representations using subword information.

  • Stanford unsupervised word embedding algorithm based on global word-word co-occurrence statistics.

  • Alibaba DAMO general text embedding model for semantic similarity and dense retrieval.

  • Jina AI embedding models and API for multilingual multimodal and long-context retrieval use cases.

  • Nomic open text embedding model for long-context semantic search and retrieval.

  • OpenAI embedding model family for converting text into vectors for search clustering and retrieval.

  • Python framework for sentence text and image embeddings using Transformer models.

  • High-performance Hugging Face inference server for text embeddings and reranking models.

  • All-in-one embeddings database for semantic search and language model workflows.

  • Voyage AI embedding and reranking models for retrieval search and RAG applications.

  • Google toolkit for learning efficient word vector representations from large text corpora.

Vector databases & indexes

  • Spotify Annoy is a C++ and Python library for approximate nearest-neighbor search with memory-mapped indexes.

  • Open source AI application database for embeddings vector search and retrieval workflows

  • Elasticsearch is Elastic's distributed search and analytics engine for full-text, vector, and hybrid retrieval workloads.

  • Facebook AI Similarity Search is a library for efficient similarity search and clustering of dense vectors.

  • HNSWlib is a lightweight C++ and Python library for approximate nearest-neighbor search using HNSW graphs.

  • Open-source vector database for AI applications built on the Lance columnar data format.

  • Marqo is an AI-native search engine and API for multimodal vector search and retrieval.

  • PostgreSQL extension that adds vector similarity search for embeddings inside Postgres

  • Managed vector database for building search recommendation and RAG applications at scale

  • Vector similarity search engine and database for high performance neural search and RAG systems

  • Google ScaNN performs efficient vector similarity search at scale for maximum inner product and nearest-neighbor queries.

  • Serverless vector database focused on low cost large scale similarity search

  • Search and serving engine for vector search, recommendation, and large-scale inference.

  • Open source vector database with hybrid search generative search and scalable AI native data storage

Retrieval engines & rerankers

  • Anserini is a Lucene based toolkit for reproducible information retrieval research.

  • Apache Lucene is a high-performance Java search library for indexing and ranked retrieval.

  • Classic probabilistic bag-of-words ranking function used in search engines and information retrieval.

  • Cohere Rerank is an API model for reordering search results and documents by semantic relevance to a query.

  • Stanford late-interaction neural retrieval model for efficient and effective passage search.

  • Meta Dense Passage Retrieval implementation for open-domain question answering.

  • Open-source search and analytics suite with full-text search and vector search capabilities.

  • Pyserini is a Python toolkit for reproducible information retrieval research and retrieval pipelines.

  • NAVER sparse lexical and expansion model for neural information retrieval.

  • Tantivy is a Rust full-text search engine library inspired by Apache Lucene and used to build search systems.

  • Terrier is an open source search engine and information retrieval platform.

  • Whoosh is a pure Python library for indexing text and searching indexed content.

Agents & RAG frameworks

  • Microsoft AutoGen is a framework for building and evaluating multi agent AI applications.

  • CrewAI is a Python framework and platform for orchestrating role based multi agent automations.

  • Stanford framework for programming and optimizing language-model pipelines with declarative modules.

  • Guidance lets developers control language models with constrained generation and structured prompting.

  • Deepset's open-source framework for building production-ready LLM applications and RAG pipelines.

  • LangChain is a framework for building LLM applications with chains agents retrieval and integrations.

  • Letta is a framework and platform for building stateful agents with memory and tool use.

  • LlamaIndex is a framework for connecting data to LLM apps with agents workflows and RAG pipelines.

  • Mastra is a TypeScript agent framework for workflows memory evals and integrations in AI apps.

  • NVIDIA framework for adding programmable guardrails and safety controls to conversational AI apps.

  • OpenAI Agents SDK is a Python toolkit for building agentic applications with tools handoffs and tracing.

  • Library for structured text generation with LLMs using regexes and type constraints.

  • Microsoft SDK for orchestrating AI agents and integrating LLMs with application workflows.

Evaluation & benchmarks

  • Heterogeneous benchmark suite and codebase for zero shot information retrieval evaluation.

  • Open-source LLM evaluation framework for testing RAG and language-model applications.

  • Giskard provides testing and evaluation tools to detect risks in AI models and LLM applications.

  • General Language Understanding Evaluation benchmark suite for natural language understanding systems.

  • Microsoft Machine Reading Comprehension dataset and benchmark for passage ranking and QA tasks.

  • Massive Text Embedding Benchmark for evaluating text embedding models across many tasks.

  • Open-source tool for testing and red-teaming prompts and LLM applications.

  • Framework for evaluating retrieval augmented generation and LLM applications with metrics and test data generation.

  • Stanford Question Answering Dataset benchmark for reading comprehension question answering systems.

  • More challenging language understanding benchmark suite building on GLUE for NLU evaluation.

  • Benchmark for resolving real GitHub software issues using language models and coding agents.

  • trec_eval is the NIST tool for evaluating ad hoc retrieval runs using TREC measures.