v4.2 · H100 + B200 pools live in 6 regions

Serve open models at scale without renting a fleet you can’t fill.

Tensorlane packs your inference onto spot and reserved GPUs, autoscales by token-second, and holds P95 under 240ms at 90% utilization. You ship a model; we keep the silicon busy.

$ tl deploy llama-3.3-70b --quant fp8 --autoscale
router · prod-us-east-2
routing
Tokens / sec · all pools
1,284,602
+18.4% vs 1h ago
Model · poolP95UtilGPU
llama-3.3-70b · fp8186ms93%H100
mixtral-8x22b · awq231ms89%H100
qwen2.5-72b · fp8204ms91%B200
whisper-large-v3altim 78ms84%L40S
bge-m3 · embed14ms61%A10
Last 24h · global
live
Median P95
214 ms
−11ms vs Tue
Fleet utilization
90.6 %
+4.2pp vs Tue
Cost / 1M tokens
$0.38
−$0.06 vs Tue
Cold-start P99
3.9 s
+0.3s vs Tue
Spot reclaim handled 0 dropped

412 spot interruptions drained and rerouted in the last day — zero requests failed, zero pages.

Inference running in production at

Halcyon Northwind Vantage Brightwave Strata Cinder Lattice Stratos Halcyon Northwind Vantage Brightwave Strata Cinder Lattice Stratos

Benchmarks

The numbers we publish, measured on your traffic shape.

Every figure below is from a 7-day replay of a mixed chat + RAG workload at 2,400 req/s. We re-run it nightly against the live fleet and post the regression diff.

Throughput · 70B fp8 +2.1×
3,940 tok/s/GPU

Continuous batching + paged KV cache. 2.1× the tok/s of a naive vLLM deploy on the same H100.

Time to first token P95
92 ms
P50
41ms
P90
78ms
P95
92ms
P99
141ms

Speculative decoding plus a warm pool keep TTFT flat through autoscale events.

Cost at 90% util −47%
$0.38 / 1M
Reserved-only baseline$0.72
Tensorlane spot blend$0.38

Spot-aware scheduler keeps interruptible silicon at 64% of the blend without touching your SLA.

Platform

One control plane from weights to wire.

Routing, quantization, autoscaling, observability, and spend governance — the pieces you would otherwise stitch from five tools.

Adaptive router

Route every request to its cheapest healthy GPU

The scheduler reads live queue depth, KV-cache pressure, and spot reclaim signals, then places each request on the pool that meets your P95 for the least money. Reclaims drain in flight.

route req_8f21c · class=chat · ctx=4096
·candidates: H100-east(93%) B200-west(78%) A100-east(61%)
·picked A100-east · est P95 198ms · $0.31/1M
placed in 0.8ms · queue depth 2
Quantization

fp8 / awq / gptq, picked for you

We benchmark each format against your eval set and deploy the one that holds accuracy within 0.4 points at the best tok/s.

fp8 · default awq-int4 gptq-int8
Observability

Per-token traces, OTel-native

Every span — admission, prefill, decode, post — exported to your collector. Latency budgets alert before they breach.

Autoscale

Scale to zero between bursts

Warm pools hold a floor; everything above scales on token-second demand and parks at zero when idle.

Spend caps

Hard budgets per project

Set a monthly ceiling and a degrade policy. Past the cap we shed to a cheaper pool instead of paging you.

Compliance

SOC 2 · in-region inference

Pin a model to eu-central or us-east only. Logs stay in-region; weights never leave the VPC you choose.

Latency Cost Errors Regions 7d
Req / s
2,418
P95
214ms
Error rate
0.014%
Tensorlane Self-managed

Single pane

Watch latency, cost, and errors on one timeline

Stop correlating Grafana, your cloud bill, and a log search by hand. Tensorlane stitches every signal to the same request id, so a P99 spike, the pool it landed on, and what it cost are one click apart.

  • Replay any 5-minute window against a candidate config before you ship it.
  • Per-token cost attribution down to the customer and the route.
  • Alert on a latency budget, not a raw threshold that drifts with load.
Read the routing guide

Pricing

Pay for tokens, not idle GPUs

Token-second metering on every tier. No minimum commit on Sandbox; reserve capacity when you need a floor.

Sandbox

shared pool
$0
+ $0.55 / 1M tokens

Kick the tires on shared H100s with a real model and real traces.

  • Any open-weight model, shared GPUs
  • 5 req/s soft cap
  • 7-day metric retention
  • Community Slack support
Start free
Most popular

Scale

spot blend
$0.38
/ 1M tokens · 70B fp8

Dedicated routing across spot + reserved with a P95 SLA.

  • Adaptive router + autoscale to zero
  • SLA: P95 < 240ms, 99.9% uptime
  • OTel traces + 90-day retention
  • Per-project spend caps
  • Email + priority Slack
Start 14-day trial

Reserved

committed GPUs
Custom
annual capacity reservation

A floor of dedicated silicon, in your region, with a named TAM.

  • Reserved H100 / B200 capacity
  • In-region inference + VPC peering
  • Custom P95 SLA + on-call
  • SOC 2 report + DPA + audit log
Talk to engineering

Customers

Teams that moved off a hand-rolled fleet

“We cut our inference bill 44% the week we cut over, and our P95 actually dropped. The spot-reclaim handling is the part I can’t replicate in-house.”
HS
Hana Suzuki
ML Lead · Brightwave
“The per-token cost attribution settled a six-month argument with finance in an afternoon. I can finally show which customer is expensive and why.”
KM
Kenji Mori
Platform Lead · Northwind
“We deploy a new fine-tune every Tuesday. One push, the router warms a pool, and the old version drains with zero dropped requests. It just works.”
YA
Yusuf Abara
CTO · Stratos

FAQ

Questions an infra lead asks

If yours isn’t here, the docs go deeper — or page an engineer in the shared Slack.

How do you keep my SLA when a spot pool gets reclaimed?+

Reclaim notices arrive ~120s ahead. The router stops admitting new requests to that pool, lets in-flight decodes finish on the warm reserved floor, and migrates KV state. Across 412 reclaims yesterday, zero requests failed.

Which quantization do you run, and how much accuracy do I lose?+

Default is fp8 with per-channel scaling. We replay your eval set against fp8, awq-int4, and gptq-int8 and deploy the format that holds within 0.4 points of bf16. You can pin a format if you’d rather decide.

Can I bring my own weights and a custom inference image?+

Yes. Push a safetensors checkpoint or point us at a private HF repo, and optionally supply an OCI image with your engine. We validate it against the router contract, then schedule it like any first-party model.

How is a token-second metered for billing?+

We bill prefill and decode tokens separately at the GPU-class rate, summed per request and rounded to the microcent. Idle warm-pool time on Scale is included; you only pay for tokens that move.

Will my inference and logs stay in one region?+

Pin a model to a region set (e.g. eu-central only) and the scheduler never places it elsewhere, even under load. Traces and prompts are stored in-region; weights live inside the VPC you nominate.

Deploy your first model in the next ten minutes

No commit, no fleet to provision. Push a checkpoint, get a streaming endpoint, watch the traces light up.

$ curl -fsSL tensorlane.dev/install | sh