55+ curated models
OpenAI- and Anthropic-compatible
VPC + zero data retention by default
Production AI runs on DigitalOcean
You pay for GPUs, not requests.
Inference rarely runs at steady state. Traffic spikes, then disappears, but capacity stays provisioned. Even with autoscaling, you’re still paying for idle infrastructure built for peak demand.
Shipping one model call turns into an entire platform.
What starts as a simple API request quickly expands into routing, retries, caching, observability, rate limits, and cost controls. By production, you’ve built an internal inference stack.
Model APIs look interchangeable until you try to switch them.
Every provider has different schemas, latency profiles, safety layers, and pricing behavior. Switching isn’t swapping endpoints — it’s reworking routing, evaluation, and application logic.
Pay per token. No GPU contracts. No minimums.
Forecasting your inference cost should look like forecasting your AWS bill. Batch at ~50% of real-time. Off-peak dynamic pricing on Mini Max M2.5 and Kimi K2.5 today, expanding.
$1M+ customer ARR up 179% YoY in Q1 2026.
>80% of AI customer ARR now from inference + core cloud, not bare metal.
Scale-to-zero on Serverless. Reserved capacity on Dedicated when you graduate.
PREDICTABLE AI ECONOMICS
If it can’t take real traffic, it doesn’t count.
Independently ranked, custom-kernel optimized, 55+ models behind one API. VPC, zero data retention, platform guardrails, and built-in observability ship as defaults — not enterprise add-ons.
#1 by Artificial Analysis on output speed for DeepSeek V3.2 and Qwen 3.5 397B.
230 tok/sec on DeepSeek V3.2 — 3.9× faster than AWS Bedrock.
180M+ patient interactions — Hippocratic AI clinical calls/day at 400ms in production.
PRODUCTION-GRADE BY DEFAULT
Bring your model. Keep your stack open.
Open-weight out of the box: DeepSeek, Qwen, Llama, Mixtral, Phi, gpt-oss. LoRA on Serverless lands Q2; full BYOM on Dedicated today. No proprietary lock-in.
Five integrated layers: compute, network, storage, data, AI — open at every one.
Messages API for Claude Code-compatible agentic workflows.
Drop-in OpenAI and Anthropic schemas. Migrate behind a feature flag, not a rewrite.
OPEN AT EVERY LAYER
Image, video, speech, vision-language. Same API, same bill.
Stable Diffusion 3.5 for image. Wan 2.2 for video. Qwen3 TTS for speech. Nemotron and Kimi for vision-language. Plus the lifecycle around them routing, evals, observability — that wrappers don’t have.
Among inference-only competitors, only Together ships full image/video/audio. Fireworks has no video. Baseten, Groq, DeepInfra have no multimodal.
Platform content guardrails on image and video by default — not opt-in.
Native multimodal generation, not a stitched chain of vendor APIs.
EVERY MODALITY, ONE PLATFORM
Nothing to provision. Nothing to size.
Inference runs only when you call it. Capacity scales automatically based on demand, so you don’t manage GPUs or plan for peak traffic.
Scale-to-zero by default
Automatic handling of traffic spikes
Pay only for active inference
01
You see exactly how inference behaves.
Serverless Inference includes observability and control primitives so you can understand and manage production workloads without adding external tooling.
Metrics for latency, tokens, errors, and spend
Request-level visibility across workloads
Built-in controls for rate limits and usage tracking
02
Models are interchangeable.
Serverless Inference provides a unified API so teams can switch or experiment with models without changing application logic or rewriting integrations.
OpenAI- and Anthropic-compatible API
Consistent request and response format
Swap models without code changes
03
NO CARD · FREE UNTIL YOU MAKE A CALL · CANCEL ANY TIME
One rate card. 55+ models. No commits.
Pay per token, billed by the second of generation. Off-peak dynamic serverless inference pricing on MiniMax M2.5 (Public Preview) and Kimi K2.5 today, expanding across the catalog soon.
Cloud platforms with broad capability, more coordination required
Large cloud providers offer extensive infrastructure and model access within a unified environment. Teams often benefit from breadth and enterprise features, but deployments can involve multiple services, configuration layers, and procurement steps.
VS. HYPERSCALERS
Lightweight inference access without full infrastructure ownership
Some solutions provide streamlined access to models through a simple API layer. They reduce operational overhead but typically rely on external systems for storage, deployment, and production orchestration.
VS. WRAPPERS
Compute-first infrastructure for custom inference stacks
GPU-focused providers offer raw compute resources for teams that want full control over their inference stack. This approach enables flexibility but requires assembling and maintaining the surrounding system components.
VS. NEOCLOUDS
Fastest path to model access with additional production layers needed
Direct model endpoints provide immediate access to frontier models and are often used for initial development. Production applications usually require additional layers for routing, scaling, and operational management.
VS. DIRECT APIS
“In healthcare AI, a node going down isn’t just an SLA issue — it impacts patient experience. We’ve pressed DigitalOcean hard on reliability, access to the newest hardware, and the ability to scale efficiently. They’ve delivered.”
Debajyoti Datta
Co-Founder, Hippocratic AI
“Serverless Inference is fantastic because we can make as many calls as we need without worrying about provisioning infrastructure. It just scales automatically.”
Carlo Ruiz
Infrastructure Engineer, Traversal
Create a key.
Sign up with email or GitHub. Open the console and generate an API key in one click. Keys are scoped and can be rotated at any time.
01
Watch tokens, not nodes.
Track tokens, latency, errors, and spend directly in the console. Set usage limits and alerts in a few clicks.
03
Point your code.
Update your OpenAI- or Anthropic-compatible SDK to point to DigitalOcean. Swap the model name to a supported Serverless Inference model and start making requests.
02
Is there a free tier?
Yes. You can sign up without a credit card and start testing the API immediately. Pay-per-token pricing applies once you exceed the included usage. Learn how to get started here.
Q · 01
How do I migrate from OpenAI or Anthropic?
Update your base URL to the DigitalOcean endpoint and select a supported model. The OpenAI- and Anthropic-compatible SDKs continue to work, along with popular frameworks like LangChain and LlamaIndex. Most teams validate with a small workload before fully switching.
Q · 02
What happens when a frontier provider goes down?
Inference Router (Public Preview) can route requests across models based on policy, including fallback behavior when a provider is unavailable. This can help maintain continuity when individual model APIs experience disruptions.
Q · 03
When should I use Serverless vs. Dedicated?
Serverless is the starting point for most workloads — it’s serverless, usage-based, and scales automatically. Dedicated Inference is designed for sustained, high-throughput production workloads. Both use the same API and billing system.
Q · 04
Can I bring my own model?
Dedicated Inference supports custom models today. Open-weight models such as DeepSeek, Qwen, and Llama are available out of the box on Serverless Inference, with additional customization options expanding over time.
Q · 05
© DigitalOcean, LLC.
The problem
Our solution
Pricing
HOW THE MARKET BREAKS DOWN
In production
Up and running in minutes
Have questions?
Production AI runs on DigitalOcean
Serverless Inference · Generally Available