Data & Labeling | MLNotebooks Tools

Datasets & catalogs

C4
AllenAI hosted Colossal Clean Crawled Corpus dataset derived from Common Crawl for language model pretraining.
CIFAR-10/100
CIFAR-10 and CIFAR-100 are labeled tiny-image datasets for image classification benchmarking and model training.
Common Crawl
Open repository of web crawl data commonly used for large scale language model pretraining.
Hugging Face Datasets
Hugging Face Datasets is a library for accessing sharing and processing datasets for machine learning.
ImageNet
ImageNet is a large visual database organized by WordNet synsets for image classification and computer vision research.
LAION-5B
LAION-5B is an open dataset of CLIP-filtered image-text pairs for training and evaluating vision-language models.
MNIST
MNIST is a database of handwritten digit images commonly used for training and testing image classification models.
The Pile
EleutherAI 825 GiB diverse English text corpus built for training large language models.

FFCV
Fast data loading system for accelerating computer vision model training pipelines.
Petastorm
Uber library enabling single-machine or distributed training directly from Parquet datasets.
Polars
Fast DataFrame library and query engine for efficient data processing in Python and Rust.
WebDataset
PyTorch-friendly dataset format and loader using tar shards for large-scale deep learning data.

Argilla
Collaboration platform for AI engineers and domain experts to build high quality datasets for LLM and NLP workflows.
CVAT
Computer vision annotation platform for labeling images and videos for detection segmentation and tracking tasks.
doccano
Open source text annotation tool for classification sequence labeling and sequence to sequence datasets.
Label Studio
Data labeling platform for annotating text images audio video time series and multimodal data.
Prodigy
Scriptable annotation tool for creating training and evaluation data with active learning workflows.
Scale AI
Data engine for high quality labeled data model evaluation and reinforcement learning from human feedback
Snorkel
Open source framework for programmatically building and managing training datasets with weak supervision.
Snorkel AI
Data centric AI platform for programmatic labeling data development and model improvement
Surge AI
Human data platform for RLHF data labeling evaluation and enterprise AI training datasets

Feast
Feast is an open source feature store for managing and serving machine learning features.
Hopsworks
Hopsworks provides an AI lakehouse and feature store platform for building and operating ML systems.
Tecton
Tecton is an enterprise feature platform for building serving and monitoring production ML features.

Great Expectations
Great Expectations is an open source framework for validating documenting and profiling data quality.
Guardrails AI
Guardrails AI validates and safeguards LLM inputs and outputs with programmable checks and validators.
Instructor
Python library for extracting structured outputs from LLMs using Pydantic schemas and validation.

Gretel
Platform for generating privacy preserving synthetic data and transforming sensitive datasets
Mostly AI
Platform for creating privacy safe synthetic data for analytics AI and software testing

Delta Lake
Delta Lake is an open source storage framework that brings ACID transactions and reliability to data lakes.
DVC
Open source version control system for machine learning projects data models experiments and pipelines.
LakeFS
lakeFS provides Git like version control for data lakes with branching commits and reproducible data pipelines.