Computer Vision | MLNotebooks Tools

Backbones & classification

AlexNet
AlexNet is a deep convolutional neural network that popularized GPU-trained CNNs for large-scale image classification.
ConvNeXt
ConvNeXt modernizes convolutional networks with design choices inspired by vision transformers for image recognition tasks.
DenseNet
DenseNet connects each layer to every other layer in a feed-forward CNN to improve feature reuse and gradient flow.
DINOv2
Meta self supervised vision foundation model for learning general purpose image features without labels.
EfficientNet
EfficientNet scales CNN depth width and resolution with compound coefficients for efficient image recognition models.
Inception/GoogLeNet
GoogLeNet/Inception uses inception modules to build efficient deep convolutional networks for image classification and detection.
MobileNet
MobileNet is a family of efficient CNN architectures using depthwise separable convolutions for mobile vision tasks.
ResNet
ResNet introduced deep residual learning with skip connections for training very deep image recognition networks.
timm
PyTorch image models collection with pretrained vision architectures and training utilities.
VGG
VGG very deep convolutional networks are Oxford VGG models known for simple stacked 3x3 convolutions in image recognition.
Vision Transformer (ViT)
Vision Transformer applies Transformer encoders to image patches for image classification and vision representation learning.

DeepLab
TensorFlow DeepLab implementation for semantic image segmentation with atrous convolution models.
Detectron2
Facebook AI Research library for object detection and segmentation, with reference implementations of Mask R-CNN, RetinaNet, and other architectures.
DETR
Meta Detection Transformer model for end to end object detection with transformers and bipartite matching.
Faster R-CNN
Reference Faster R-CNN codebase for region proposal based object detection research.
MMDetection
OpenMMLab PyTorch toolbox for object detection instance segmentation and related vision research.
SAM (Segment Anything)
Meta Segment Anything model for promptable image segmentation and mask generation.
SAM 2
Meta Segment Anything Model 2 for promptable object segmentation in images and videos.
U-Net
Convolutional neural network architecture for biomedical image segmentation from the Freiburg lab.
YOLO
Ultralytics implementation of YOLO models for object detection segmentation pose and tracking workflows.

BLIP-2
Salesforce BLIP-2 vision language pretraining project for bootstrapping image to language models.
CLIP
OpenAI contrastive vision language model for connecting images and text in a shared embedding space.
Florence-2
Microsoft Florence-2 vision foundation model for captioning object detection grounding and segmentation tasks.
LLaVA
Large Language and Vision Assistant project for multimodal chat and visual instruction tuning.
SigLIP
Vision language model using sigmoid loss for image text representation learning in Transformers.

docTR
Mindee document text recognition library for OCR with detection recognition and document parsing pipelines.
EasyOCR
Python OCR library for text detection and recognition across many languages using deep learning models.
PaddleOCR
PaddlePaddle OCR toolkit for multilingual text detection recognition and document parsing.
Tesseract
Open source OCR engine for recognizing printed text from images and scanned documents.
TrOCR
Transformer based OCR model in Hugging Face Transformers for printed and handwritten text recognition.

Albumentations
Fast image augmentation library used to improve computer vision model training data pipelines.
Kornia
Differentiable computer vision library for PyTorch with image processing geometry augmentation and vision AI components.
OpenCV
Open source computer vision and machine learning software library for image and video applications.
Pillow
Python Imaging Library fork for opening manipulating and saving many image file formats.
scikit-image
Python library of image processing algorithms built for scientific and computer vision workflows.

COCO
COCO is a large-scale dataset for object detection, segmentation, keypoint detection, and vision evaluation.
Open Images
Open Images is a large dataset of images with annotations for classification, detection, and segmentation.
Pascal VOC
Pascal VOC provides benchmark datasets and challenges for visual object classification, detection, and segmentation.
Roboflow
Computer vision platform for dataset management labeling model training deployment and application building.