Managed inference platforms
Model inference platform for deploying scaling and monitoring production machine learning and AI applications.
Inference platform for fast serving fine tuning and deployment of generative AI models
Cloud platform for running serverless Python applications including AI inference training jobs and data workloads.
Platform for fine tuning serving and optimizing open source large language models
Platform for running open source ML Models through APIs and deploying custom models.
Cloud GPU platform providing on demand compute serverless GPUs and infrastructure for AI workloads.
Platform for running fine tuning and deploying open source and custom generative AI models
Inference engines & runtimes
Apache TVM is an open source machine learning compiler framework for CPUs GPUs and specialized accelerators.
DeepSpeed-MII is a DeepSpeed library for low latency and high throughput inference of deep learning models.
ONNX is an open format for representing ML Models to enable interoperability across frameworks and runtimes.
Cross platform inference and training accelerator for executing ONNX models across CPUs GPUs and specialized hardware.
Intel toolkit for optimizing and deploying AI inference across Intel CPUs GPUs NPUs and other supported hardware.
Fast serving framework for large language and vision language models with efficient runtime and frontend language support.
NVIDIA SDK for high performance deep learning inference optimization and deployment on NVIDIA GPUs.
NVIDIA TensorRT-LLM is an open source library for optimizing and serving large language model inference on NVIDIA GPUs.
Hugging Face production server for text generation inference with optimized LLM serving and streaming APIs.
NVIDIA inference serving software for deploying AI models from multiple frameworks on GPUs and CPUs.
High throughput and memory efficient LLM inference and serving engine with PagedAttention and OpenAI compatible APIs.
Serving frameworks
AI application framework for building packaging and serving ML Models and LLM services.
Open source tool for packaging ML Models in containers with a predictable API for deployment.
Kubernetes custom resource platform for serving predictive and generative AI models at production scale.
Flexible high performance serving system for ML Models designed for TensorFlow production environments.
PyTorch model serving framework for deploying trained models with REST and gRPC inference endpoints.
Local inference
C and C++ inference framework for running large language models locally with GGUF quantization and broad hardware support.
Desktop application and local server for discovering downloading and running local LLMs with chat and developer APIs.
Local model runner and server for downloading managing and running large language models on personal machines.
Model gateways
LiteLLM provides an OpenAI compatible proxy and SDK for routing across LLM providers with budgets and logs.
OpenRouter is a unified API and marketplace for routing requests across many AI model providers.
Portkey is an AI gateway for model routing observability caching guardrails and reliability controls.
Vercel AI Gateway gives one endpoint for model providers with routing observability caching and usage controls.