Production AI runs on DigitalOcean
One model for every request drives unnecessary spend.
Simple tasks and complex reasoning often get sent to the same frontier model, increasing inference costs as usage scales.
Routing logic becomes an engineering time-suck.
Supporting multiple models means building retries, failover logic, rate limit handling, observability, and orchestration internally.
Your frontier model provider goes down. You have no fallback.
Production inference workloads depend on reliable model availability. Without a routing layer, provider outages, rate limits, or degraded performance can lead to dropped requests and manual failover handling by engineering teams.
Pay per token. No GPU contracts. No minimums.
Forecasting your inference cost should look like forecasting your AWS bill. Batch at ~50% of real-time. Off-peak dynamic pricing on Mini Max M2.5 and Kimi K2.5 today, expanding.
$1M+ customer ARR up 179% YoY in Q1 2026.
>80% of AI customer ARR now from inference + core cloud, not bare metal.
Scale-to-zero on Serverless. Reserved capacity on Dedicated when you graduate.
PREDICTABLE AI ECONOMICS
If it can’t take real traffic, it doesn’t count.
Independently ranked, custom-kernel optimized, 55+ models behind one API. VPC, zero data retention, platform guardrails, and built-in observability ship as defaults — not enterprise add-ons.
#1 by Artificial Analysis on output speed for DeepSeek V3.2 and Qwen 3.5 397B.
230 tok/sec on DeepSeek V3.2 — 3.9× faster than AWS Bedrock.
180M+ patient interactions — Hippocratic AI clinical calls/day at 400ms in production.
PRODUCTION-GRADE BY DEFAULT
Bring your model. Keep your stack open.
Open-weight out of the box: DeepSeek, Qwen, Llama, Mixtral, Phi, gpt-oss. LoRA on Serverless lands Q2; full BYOM on Dedicated today. No proprietary lock-in.
Five integrated layers: compute, network, storage, data, AI — open at every one.
Messages API for Claude Code-compatible agentic workflows.
Drop-in OpenAI and Anthropic schemas. Migrate behind a feature flag, not a rewrite.
OPEN AT EVERY LAYER
Image, video, speech, vision-language. Same API, same bill.
Stable Diffusion 3.5 for image. Wan 2.2 for video. Qwen3 TTS for speech. Nemotron and Kimi for vision-language. Plus the lifecycle around them routing, evals, observability — that wrappers don’t have.
Among inference-only competitors, only Together ships full image/video/audio. Fireworks has no video. Baseten, Groq, DeepInfra have no multimodal.
Platform content guardrails on image and video by default — not opt-in.
Native multimodal generation, not a stitched chain of vendor APIs.
EVERY MODALITY, ONE PLATFORM
Optimize for cost, latency or both.
Inference Router analyzes incoming requests and routes them to the best-fit model based on your routing rules and optimization preferences.
Route requests based on task complexity
Send simpler workloads to lower-cost models
Reserve frontier models for advanced reasoning
01
Built-in failover across your model pool.
If a model is down, unavailable or rate-limited, the router automatically reroutes traffic to the next best model - no dropped calls.
Automatic failover across models
No manual fallback handling required
Reduce dropped requests during provider disruptions
02
Transparency into how requests are handled.
Monitor model selection, latency, token usage, and routing behavior across workloads with built-in traceability and observability.
Track which model handled each request
Monitor latency, usage, and spend
Analyze routing behavior in real time
03
NO CARD · FREE UNTIL YOU MAKE A CALL · CANCEL ANY TIME
No fees for using Inference Router during public preview
One model endpoint. No intelligent routing.
Providers like OpenAI, Anthropic, Together AI, and Fireworks expose a single-model API. Every request goes to the same model at the same price regardless of task complexity. Teams are responsible for building routing logic, failover handling, and optimization themselves.
Vs. Single model APIs
Rule-based manual routing with more operational overhead
WS Bedrock offers routing capabilities, but they require configuration and code-based setup. Routing is tied to the Bedrock model catalog and does not support natural language task definitions.
Vs. AWS Bedrock Routing
Compute-first infrastructure for custom inference stacks
GPU-focused providers offer raw compute resources for teams that want full control over their inference stack. This approach enables flexibility but requires assembling and maintaining the surrounding system components.
Vs. Martian & Open Router
Custom routing logic becomes an ongoing engineering project.
Without a managed router, teams build internal systems for model selection, fallback handling, retries, observability, and optimization. That infrastructure requires continuous maintenance as models, pricing, and traffic patterns evolve.
Vs. Building Your Own
“DigitalOcean's Inference Router gives us the kind of intelligent model selection we would otherwise have had to build ourselves. It routes each request to the right model based on complexity, helping us reduce inference costs by more than 40% while maintaining the accuracy, speed, and reliability our users expect.”
Hovsep Seraydarian
Co-Founder and CTO, LawVo
“Our AI Ethics Engine was built with open-source AI, so running it on closed-source models felt backwards. DigitalOcean's Inference Router closed the loop — we cut per-token cost by 61% while pulling p95 latency under 400ms. Same API. Zero code changes.”
Mario Gutiérrez
CTO at Unity Financial Network and Founder of Celiums.AI
Define your model pool.
Any mix of open-source, commercial, serverless, or dedicated. Two models or twenty.
01
Watch the Inference Router handle requests automatically
The Inference Router analyzes each incoming request, selects the best-fit model using live cost and latency data, and automatically reroutes traffic if a model becomes unavailable or rate-limited.
03
Decide how requests should be routed.
Tell the router how different requests should be handled based on task complexity, cost, or latency preferences.
02
Is there a cost to use an Inference Router?
Inference Router is free to use during public preview. You are only billed for the model calls the router makes, at standard Serverless or Dedicated Inference rates. There is no additional charge for the routing layer itself.
Q · 01
How do I set up my first router?
You can create a router through the DigitalOcean Control Panel or API. Define your model pool, describe your routing rules in plain English, set your optimization preference (cost or latency), and update one line in your existing API call "model": "router:your-router-name". No code changes to your application are required.
Q · 02
What models can I add to my router pool?
Any combination of open-source and commercial models available on DigitalOcean's Serverless and Dedicated Inference tiers. You can mix frontier models with open-source alternatives in the same pool- two models or twenty.
Q · 03
What happens when a model in my pool goes down or gets rate-limited?
Inference Router automatically reroutes the request to the next best available model in your pool, no dropped calls, no manual intervention, no code changes required. Failover happens instantly without any action on your part.
Q · 04
How does the router decide which model to use for each request?
DigitalOcean's purpose-built 30B Mixture of Experts (MoE) router model reads each incoming request, resolves the intent in approximately 200ms, and selects the best-fit model based on your routing rules and optimization preference- lowest cost, lowest latency, or a custom model order-using live cost and latency data.
Q · 05
© DigitalOcean, LLC.
The problem
Our solution
Pricing
HOW THE MARKET BREAKS DOWN
In production
You define the rules. We handle the routing.
Have questions?
Production AI runs on DigitalOcean
Inference Router · Public Preview
How do I know what the router decided for each request?
Every routing decision is fully logged - model selected, task detected, latency, token usage, and cost. You can monitor routing behavior, analyze spend, and track latency across your model pool in real time through built-in observability.
Q · 06