Ctrl K

Model Serving

30 tools for model serving.

Managed inference platforms

  • Model inference platform for deploying scaling and monitoring production machine learning and AI applications.

  • Inference platform for fast serving fine tuning and deployment of generative AI models

  • Cloud platform for running serverless Python applications including AI inference training jobs and data workloads.

  • Platform for fine tuning serving and optimizing open source large language models

  • Platform for running open source ML Models through APIs and deploying custom models.

  • Cloud GPU platform providing on demand compute serverless GPUs and infrastructure for AI workloads.

  • Platform for running fine tuning and deploying open source and custom generative AI models

Inference engines & runtimes

  • Apache TVM is an open source machine learning compiler framework for CPUs GPUs and specialized accelerators.

  • DeepSpeed-MII is a DeepSpeed library for low latency and high throughput inference of deep learning models.

  • ONNX is an open format for representing ML Models to enable interoperability across frameworks and runtimes.

  • Cross platform inference and training accelerator for executing ONNX models across CPUs GPUs and specialized hardware.

  • Intel toolkit for optimizing and deploying AI inference across Intel CPUs GPUs NPUs and other supported hardware.

  • Fast serving framework for large language and vision language models with efficient runtime and frontend language support.

  • NVIDIA SDK for high performance deep learning inference optimization and deployment on NVIDIA GPUs.

  • NVIDIA TensorRT-LLM is an open source library for optimizing and serving large language model inference on NVIDIA GPUs.

  • Hugging Face production server for text generation inference with optimized LLM serving and streaming APIs.

  • NVIDIA inference serving software for deploying AI models from multiple frameworks on GPUs and CPUs.

  • High throughput and memory efficient LLM inference and serving engine with PagedAttention and OpenAI compatible APIs.

Serving frameworks

  • AI application framework for building packaging and serving ML Models and LLM services.

  • Open source tool for packaging ML Models in containers with a predictable API for deployment.

  • Kubernetes custom resource platform for serving predictive and generative AI models at production scale.

  • Flexible high performance serving system for ML Models designed for TensorFlow production environments.

  • PyTorch model serving framework for deploying trained models with REST and gRPC inference endpoints.

Local inference

  • C and C++ inference framework for running large language models locally with GGUF quantization and broad hardware support.

  • Desktop application and local server for discovering downloading and running local LLMs with chat and developer APIs.

  • Local model runner and server for downloading managing and running large language models on personal machines.

Model gateways

  • LiteLLM provides an OpenAI compatible proxy and SDK for routing across LLM providers with budgets and logs.

  • OpenRouter is a unified API and marketplace for routing requests across many AI model providers.

  • Portkey is an AI gateway for model routing observability caching guardrails and reliability controls.

  • Vercel AI Gateway gives one endpoint for model providers with routing observability caching and usage controls.