Ctrl K

Data & Labeling

32 tools for data & labeling.

Datasets & catalogs

  • AllenAI hosted Colossal Clean Crawled Corpus dataset derived from Common Crawl for language model pretraining.

  • CIFAR-10 and CIFAR-100 are labeled tiny-image datasets for image classification benchmarking and model training.

  • Open repository of web crawl data commonly used for large scale language model pretraining.

  • Hugging Face Datasets is a library for accessing sharing and processing datasets for machine learning.

  • ImageNet is a large visual database organized by WordNet synsets for image classification and computer vision research.

  • LAION-5B is an open dataset of CLIP-filtered image-text pairs for training and evaluating vision-language models.

  • MNIST is a database of handwritten digit images commonly used for training and testing image classification models.

  • EleutherAI 825 GiB diverse English text corpus built for training large language models.

Data loading & formats

  • Fast data loading system for accelerating computer vision model training pipelines.

  • Uber library enabling single-machine or distributed training directly from Parquet datasets.

  • Fast DataFrame library and query engine for efficient data processing in Python and Rust.

  • PyTorch-friendly dataset format and loader using tar shards for large-scale deep learning data.

Labeling & annotation

  • Collaboration platform for AI engineers and domain experts to build high quality datasets for LLM and NLP workflows.

  • Computer vision annotation platform for labeling images and videos for detection segmentation and tracking tasks.

  • Open source text annotation tool for classification sequence labeling and sequence to sequence datasets.

  • Data labeling platform for annotating text images audio video time series and multimodal data.

  • Scriptable annotation tool for creating training and evaluation data with active learning workflows.

  • Data engine for high quality labeled data model evaluation and reinforcement learning from human feedback

  • Open source framework for programmatically building and managing training datasets with weak supervision.

  • Data centric AI platform for programmatic labeling data development and model improvement

  • Human data platform for RLHF data labeling evaluation and enterprise AI training datasets

Feature stores

  • Feast is an open source feature store for managing and serving machine learning features.

  • Hopsworks provides an AI lakehouse and feature store platform for building and operating ML systems.

  • Tecton is an enterprise feature platform for building serving and monitoring production ML features.

Validation & quality

  • Great Expectations is an open source framework for validating documenting and profiling data quality.

  • Guardrails AI validates and safeguards LLM inputs and outputs with programmable checks and validators.

  • Python library for extracting structured outputs from LLMs using Pydantic schemas and validation.

Synthetic data

  • Platform for generating privacy preserving synthetic data and transforming sensitive datasets

  • Platform for creating privacy safe synthetic data for analytics AI and software testing

Versioning & lineage

  • Delta Lake is an open source storage framework that brings ACID transactions and reliability to data lakes.

  • Open source version control system for machine learning projects data models experiments and pipelines.

  • lakeFS provides Git like version control for data lakes with branching commits and reproducible data pipelines.