Model Serving | MLNotebooks Tools

Managed inference platforms

Baseten
Model inference platform for deploying scaling and monitoring production machine learning and AI applications.
Fireworks AI
Inference platform for fast serving fine tuning and deployment of generative AI models
Modal
Cloud platform for running serverless Python applications including AI inference training jobs and data workloads.
Predibase
Platform for fine tuning serving and optimizing open source large language models
Replicate
Platform for running open source ML Models through APIs and deploying custom models.
RunPod
Cloud GPU platform providing on demand compute serverless GPUs and infrastructure for AI workloads.
Together AI
Platform for running fine tuning and deploying open source and custom generative AI models

Apache TVM
Apache TVM is an open source machine learning compiler framework for CPUs GPUs and specialized accelerators.
DeepSpeed-MII
DeepSpeed-MII is a DeepSpeed library for low latency and high throughput inference of deep learning models.
ONNX
ONNX is an open format for representing ML Models to enable interoperability across frameworks and runtimes.
ONNX Runtime
Cross platform inference and training accelerator for executing ONNX models across CPUs GPUs and specialized hardware.
OpenVINO
Intel toolkit for optimizing and deploying AI inference across Intel CPUs GPUs NPUs and other supported hardware.
SGLang
Fast serving framework for large language and vision language models with efficient runtime and frontend language support.
TensorRT
NVIDIA SDK for high performance deep learning inference optimization and deployment on NVIDIA GPUs.
TensorRT-LLM
NVIDIA TensorRT-LLM is an open source library for optimizing and serving large language model inference on NVIDIA GPUs.
TGI
Hugging Face production server for text generation inference with optimized LLM serving and streaming APIs.
Triton Inference Server
NVIDIA inference serving software for deploying AI models from multiple frameworks on GPUs and CPUs.
vLLM
High throughput and memory efficient LLM inference and serving engine with PagedAttention and OpenAI compatible APIs.

BentoML
AI application framework for building packaging and serving ML Models and LLM services.
Cog
Open source tool for packaging ML Models in containers with a predictable API for deployment.
KServe
Kubernetes custom resource platform for serving predictive and generative AI models at production scale.
TensorFlow Serving
Flexible high performance serving system for ML Models designed for TensorFlow production environments.
TorchServe
PyTorch model serving framework for deploying trained models with REST and gRPC inference endpoints.

llama.cpp
C and C++ inference framework for running large language models locally with GGUF quantization and broad hardware support.
LM Studio
Desktop application and local server for discovering downloading and running local LLMs with chat and developer APIs.
Ollama
Local model runner and server for downloading managing and running large language models on personal machines.

LiteLLM
LiteLLM provides an OpenAI compatible proxy and SDK for routing across LLM providers with budgets and logs.
OpenRouter
OpenRouter is a unified API and marketplace for routing requests across many AI model providers.
Portkey
Portkey is an AI gateway for model routing observability caching guardrails and reliability controls.
Vercel AI Gateway
Vercel AI Gateway gives one endpoint for model providers with routing observability caching and usage controls.