Backbones & classification
AlexNet is a deep convolutional neural network that popularized GPU-trained CNNs for large-scale image classification.
ConvNeXt modernizes convolutional networks with design choices inspired by vision transformers for image recognition tasks.
DenseNet connects each layer to every other layer in a feed-forward CNN to improve feature reuse and gradient flow.
Meta self supervised vision foundation model for learning general purpose image features without labels.
EfficientNet scales CNN depth width and resolution with compound coefficients for efficient image recognition models.
GoogLeNet/Inception uses inception modules to build efficient deep convolutional networks for image classification and detection.
MobileNet is a family of efficient CNN architectures using depthwise separable convolutions for mobile vision tasks.
ResNet introduced deep residual learning with skip connections for training very deep image recognition networks.
PyTorch image models collection with pretrained vision architectures and training utilities.
VGG very deep convolutional networks are Oxford VGG models known for simple stacked 3x3 convolutions in image recognition.
Vision Transformer applies Transformer encoders to image patches for image classification and vision representation learning.
Detection & segmentation
TensorFlow DeepLab implementation for semantic image segmentation with atrous convolution models.
Facebook AI Research library for object detection and segmentation, with reference implementations of Mask R-CNN, RetinaNet, and other architectures.
Meta Detection Transformer model for end to end object detection with transformers and bipartite matching.
Reference Faster R-CNN codebase for region proposal based object detection research.
OpenMMLab PyTorch toolbox for object detection instance segmentation and related vision research.
Meta Segment Anything model for promptable image segmentation and mask generation.
Meta Segment Anything Model 2 for promptable object segmentation in images and videos.
Convolutional neural network architecture for biomedical image segmentation from the Freiburg lab.
Ultralytics implementation of YOLO models for object detection segmentation pose and tracking workflows.
Vision-language & multimodal
Salesforce BLIP-2 vision language pretraining project for bootstrapping image to language models.
OpenAI contrastive vision language model for connecting images and text in a shared embedding space.
Microsoft Florence-2 vision foundation model for captioning object detection grounding and segmentation tasks.
Large Language and Vision Assistant project for multimodal chat and visual instruction tuning.
Vision language model using sigmoid loss for image text representation learning in Transformers.
OCR & document AI
Mindee document text recognition library for OCR with detection recognition and document parsing pipelines.
Python OCR library for text detection and recognition across many languages using deep learning models.
PaddlePaddle OCR toolkit for multilingual text detection recognition and document parsing.
Open source OCR engine for recognizing printed text from images and scanned documents.
Transformer based OCR model in Hugging Face Transformers for printed and handwritten text recognition.
Image processing & augmentation
Fast image augmentation library used to improve computer vision model training data pipelines.
Differentiable computer vision library for PyTorch with image processing geometry augmentation and vision AI components.
Open source computer vision and machine learning software library for image and video applications.
Python Imaging Library fork for opening manipulating and saving many image file formats.
Python library of image processing algorithms built for scientific and computer vision workflows.
Datasets & platforms
COCO is a large-scale dataset for object detection, segmentation, keypoint detection, and vision evaluation.
Open Images is a large dataset of images with annotations for classification, detection, and segmentation.
Pascal VOC provides benchmark datasets and challenges for visual object classification, detection, and segmentation.
Computer vision platform for dataset management labeling model training deployment and application building.