Datasets & catalogs
AllenAI hosted Colossal Clean Crawled Corpus dataset derived from Common Crawl for language model pretraining.
CIFAR-10 and CIFAR-100 are labeled tiny-image datasets for image classification benchmarking and model training.
Open repository of web crawl data commonly used for large scale language model pretraining.
Hugging Face Datasets is a library for accessing sharing and processing datasets for machine learning.
ImageNet is a large visual database organized by WordNet synsets for image classification and computer vision research.
LAION-5B is an open dataset of CLIP-filtered image-text pairs for training and evaluating vision-language models.
MNIST is a database of handwritten digit images commonly used for training and testing image classification models.
EleutherAI 825 GiB diverse English text corpus built for training large language models.
Data loading & formats
Fast data loading system for accelerating computer vision model training pipelines.
Uber library enabling single-machine or distributed training directly from Parquet datasets.
Fast DataFrame library and query engine for efficient data processing in Python and Rust.
PyTorch-friendly dataset format and loader using tar shards for large-scale deep learning data.
Labeling & annotation
Collaboration platform for AI engineers and domain experts to build high quality datasets for LLM and NLP workflows.
Computer vision annotation platform for labeling images and videos for detection segmentation and tracking tasks.
Open source text annotation tool for classification sequence labeling and sequence to sequence datasets.
Data labeling platform for annotating text images audio video time series and multimodal data.
Scriptable annotation tool for creating training and evaluation data with active learning workflows.
Data engine for high quality labeled data model evaluation and reinforcement learning from human feedback
Open source framework for programmatically building and managing training datasets with weak supervision.
Data centric AI platform for programmatic labeling data development and model improvement
Human data platform for RLHF data labeling evaluation and enterprise AI training datasets
Feature stores
Feast is an open source feature store for managing and serving machine learning features.
Hopsworks provides an AI lakehouse and feature store platform for building and operating ML systems.
Tecton is an enterprise feature platform for building serving and monitoring production ML features.
Validation & quality
Great Expectations is an open source framework for validating documenting and profiling data quality.
Guardrails AI validates and safeguards LLM inputs and outputs with programmable checks and validators.
Python library for extracting structured outputs from LLMs using Pydantic schemas and validation.
Synthetic data
Platform for generating privacy preserving synthetic data and transforming sensitive datasets
Platform for creating privacy safe synthetic data for analytics AI and software testing
Versioning & lineage
Delta Lake is an open source storage framework that brings ACID transactions and reliability to data lakes.
Open source version control system for machine learning projects data models experiments and pipelines.
lakeFS provides Git like version control for data lakes with branching commits and reproducible data pipelines.