55+ models, every modality. One API key, one bill.

The fastest way to run AI on DigitalOcean. Serverless inference provides a single OpenAI- and Anthropic-compatible API for all your workloads. No GPU provisioning, no infrastructure to manage — just tokens, latency, and output. It runs next to your databases, storage, networking, and agents on the AI-Native Cloud with no egress between layers.

55+ curated models

OpenAI- and Anthropic-compatible

VPC + zero data retention by default

By Artificial Analysis on output speed for DeepSeek V3.2 and Qwen3.5 397B

#1

230 tok/s

DeepSeek V3.2 — 3.9× faster than Amazon Bedrock

40 % reduction in latency

Hippocratic AI experienced lower end-to-end P99 latency and 2× higher throughput

Production AI runs on DigitalOcean

Inference should feel simple. Most platforms make it anything but.

Once AI reaches production traffic, teams stop worrying about models and start managing infrastructure, routing logic, scaling behavior, and vendor complexity.

You pay for GPUs, not requests.

Inference rarely runs at steady state. Traffic spikes, then disappears, but capacity stays provisioned. Even with autoscaling, you’re still paying for idle infrastructure built for peak demand.

Shipping one model call turns into an entire platform.

What starts as a simple API request quickly expands into routing, retries, caching, observability, rate limits, and cost controls. By production, you’ve built an internal inference stack.

Model APIs look interchangeable until you try to switch them.

Every provider has different schemas, latency profiles, safety layers, and pricing behavior. Switching isn’t swapping endpoints — it’s reworking routing, evaluation, and application logic.

Pay-per-token.
Off-peak when you can wait.

Pay per token. No GPU contracts. No minimums.

Forecasting your inference cost should look like forecasting your AWS bill. Batch at ~50% of real-time. Off-peak dynamic pricing on Mini Max M2.5 and Kimi K2.5 today, expanding.

$1M+ customer ARR up 179% YoY in Q1 2026.

>80% of AI customer ARR now from inference + core cloud, not bare metal.

Scale-to-zero on Serverless. Reserved capacity on Dedicated when you graduate.

PREDICTABLE AI ECONOMICS

If it can’t take real traffic, it doesn’t count.

Independently ranked, custom-kernel optimized, 55+ models behind one API. VPC, zero data retention, platform guardrails, and built-in observability ship as defaults — not enterprise add-ons.

#1 by Artificial Analysis on output speed for DeepSeek V3.2 and Qwen 3.5 397B.

230 tok/sec on DeepSeek V3.2 — 3.9× faster than AWS Bedrock.

180M+ patient interactions — Hippocratic AI clinical calls/day at 400ms in production.

PRODUCTION-GRADE BY DEFAULT

Bring your model. Keep your stack open.

Open-weight out of the box: DeepSeek, Qwen, Llama, Mixtral, Phi, gpt-oss. LoRA on Serverless lands Q2; full BYOM on Dedicated today. No proprietary lock-in.

Five integrated layers: compute, network, storage, data, AI — open at every one.

Messages API for Claude Code-compatible agentic workflows.

Drop-in OpenAI and Anthropic schemas. Migrate behind a feature flag, not a rewrite.

OPEN AT EVERY LAYER

Image, video, speech, vision-language. Same API, same bill.

Stable Diffusion 3.5 for image. Wan 2.2 for video. Qwen3 TTS for speech. Nemotron and Kimi for vision-language. Plus the lifecycle around them routing, evals, observability — that wrappers don’t have.

Among inference-only competitors, only Together ships full image/video/audio. Fireworks has no video. Baseten, Groq, DeepInfra have no multimodal.

Platform content guardrails on image and video by default — not opt-in.

Native multimodal generation, not a stitched chain of vendor APIs.

EVERY MODALITY, ONE PLATFORM

From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean.

DigitalOcean makes production inference simple.

Nothing to provision. Nothing to size.

Inference runs only when you call it. Capacity scales automatically based on demand, so you don’t manage GPUs or plan for peak traffic.

Scale-to-zero by default

Automatic handling of traffic spikes

Pay only for active inference

01

You see exactly how inference behaves.

Serverless Inference includes observability and control primitives so you can understand and manage production workloads without adding external tooling.

Metrics for latency, tokens, errors, and spend

Request-level visibility across workloads

Built-in controls for rate limits and usage tracking

02

Models are interchangeable.

Serverless Inference provides a unified API so teams can switch or experiment with models without changing application logic or rewriting integrations.

OpenAI- and Anthropic-compatible API

Consistent request and response format

Swap models without code changes

03

NO CARD · FREE UNTIL YOU MAKE A CALL · CANCEL ANY TIME

One rate card. 55+ models. No commits.

Pay per token, billed by the second of generation. Off-peak dynamic serverless inference pricing on MiniMax M2.5 (Public Preview) and Kimi K2.5 today, expanding across the catalog soon.

See full price list →

Most inference stacks fall into one of four patterns. DigitalOcean brings them together.

Teams today typically build on top of cloud platforms, inference APIs, GPU infrastructure, or direct model endpoints. Each approach solves part of the problem, but leaves gaps when moving to production scale.

Cloud platforms with broad capability, more coordination required

Large cloud providers offer extensive infrastructure and model access within a unified environment. Teams often benefit from breadth and enterprise features, but deployments can involve multiple services, configuration layers, and procurement steps.

VS. HYPERSCALERS

Lightweight inference access without full infrastructure ownership

Some solutions provide streamlined access to models through a simple API layer. They reduce operational overhead but typically rely on external systems for storage, deployment, and production orchestration.

VS. WRAPPERS

Compute-first infrastructure for custom inference stacks

GPU-focused providers offer raw compute resources for teams that want full control over their inference stack. This approach enables flexibility but requires assembling and maintaining the surrounding system components.

VS. NEOCLOUDS

Fastest path to model access with additional production layers needed

Direct model endpoints provide immediate access to frontier models and are often used for initial development. Production applications usually require additional layers for routing, scaling, and operational management.

VS. DIRECT APIS

Numbers from teams already running on DigitalOcean.

Artificial Analysis output speed · DeepSeek V3.2 and Qwen 3.5 397B.

#1 AA

230 tok/s

DeepSeek V3.2 — 3.9× faster than AWS Bedrock, independently benchmarked.

40% reduction in latency

Hippocratic AI experienced lower end-to-end P99 latency and 2× higher throughput

179 %

YoY growth in $1M+ AI customer ARR · Q1 2026.

“In healthcare AI, a node going down isn’t just an SLA issue — it impacts patient experience. We’ve pressed DigitalOcean hard on reliability, access to the newest hardware, and the ability to scale efficiently. They’ve delivered.”

Debajyoti Datta
Co-Founder, Hippocratic AI

“Serverless Inference is fantastic because we can make as many calls as we need without worrying about provisioning infrastructure. It just scales automatically.”

Carlo Ruiz
Infrastructure Engineer, Traversal

Three steps and you’re making API calls.

Create a key.

Sign up with email or GitHub. Open the console and generate an API key in one click. Keys are scoped and can be rotated at any time.

01

Watch tokens, not nodes.

Track tokens, latency, errors, and spend directly in the console. Set usage limits and alerts in a few clicks.

03

Point your code.

Update your OpenAI- or Anthropic-compatible SDK to point to DigitalOcean. Swap the model name to a supported Serverless Inference model and start making requests.

02

A few things teams typically want to know.

Is there a free tier?

Yes. You can sign up without a credit card and start testing the API immediately. Pay-per-token pricing applies once you exceed the included usage. Learn how to get started here.

Q · 01

How do I migrate from OpenAI or Anthropic?

Update your base URL to the DigitalOcean endpoint and select a supported model. The OpenAI- and Anthropic-compatible SDKs continue to work, along with popular frameworks like LangChain and LlamaIndex. Most teams validate with a small workload before fully switching.

Q · 02

What happens when a frontier provider goes down?

Inference Router (Public Preview) can route requests across models based on policy, including fallback behavior when a provider is unavailable. This can help maintain continuity when individual model APIs experience disruptions.

Q · 03

When should I use Serverless vs. Dedicated?

Serverless is the starting point for most workloads — it’s serverless, usage-based, and scales automatically. Dedicated Inference is designed for sustained, high-throughput production workloads. Both use the same API and billing system.

Q · 04

Can I bring my own model?

Dedicated Inference supports custom models today. Open-weight models such as DeepSeek, Qwen, and Llama are available out of the box on Serverless Inference, with additional customization options expanding over time.

Q · 05

Terms of Service

Privacy Policy

The problem

Our solution

Pricing

HOW THE MARKET BREAKS DOWN

In production

Up and running in minutes

Have questions?

Get started →Talk to sales

Production AI runs on DigitalOcean

From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean.

Stop stitching infra together. Run AI on the AI-Native Cloud.

Serverless Inference gives you access to text, image, and speech models through a single API with built-in scaling, observability, and usage-based pricing.

Get started →Talk to sales

Serverless Inference · Generally Available

55+ models, every modality. One API key, one bill.

Inference should feel simple. Most platforms make it anything but.

Once AI reaches production traffic, teams stop worrying about models and start managing infrastructure, routing logic, scaling behavior, and vendor complexity.

Pay-per-token. Off-peak when you can wait.

From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean.

DigitalOcean makes production inference simple.

Most inference stacks fall into one of four patterns. DigitalOcean brings them together.

Teams today typically build on top of cloud platforms, inference APIs, GPU infrastructure, or direct model endpoints. Each approach solves part of the problem, but leaves gaps when moving to production scale.

Numbers from teams already running on DigitalOcean.

Three steps and you’re making API calls.

A few things teams typically want to know.

From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean.

Stop stitching infra together. Run AI on the AI-Native Cloud.

Serverless Inference gives you access to text, image, and speech models through a single API with built-in scaling, observability, and usage-based pricing.

Pay-per-token.
Off-peak when you can wait.