Tensorlane — GPU inference infrastructure for production models

Benchmarks

The numbers we publish, measured on your traffic shape.

Every figure below is from a 7-day replay of a mixed chat + RAG workload at 2,400 req/s. We re-run it nightly against the live fleet and post the regression diff.

Throughput · 70B fp8 +2.1×

3,940 tok/s/GPU

Continuous batching + paged KV cache. 2.1× the tok/s of a naive vLLM deploy on the same H100.

Time to first token P95

92 ms

P50

41ms

P90

78ms

P95

92ms

P99

141ms

Speculative decoding plus a warm pool keep TTFT flat through autoscale events.

Cost at 90% util −47%

$0.38 / 1M

Reserved-only baseline$0.72

Tensorlane spot blend$0.38

Spot-aware scheduler keeps interruptible silicon at 64% of the blend without touching your SLA.

Platform

One control plane from weights to wire.

Routing, quantization, autoscaling, observability, and spend governance — the pieces you would otherwise stitch from five tools.

Adaptive router

Route every request to its cheapest healthy GPU

The scheduler reads live queue depth, KV-cache pressure, and spot reclaim signals, then places each request on the pool that meets your P95 for the least money. Reclaims drain in flight.

→route req_8f21c · class=chat · ctx=4096

·candidates: H100-east(93%) B200-west(78%) A100-east(61%)

·picked A100-east · est P95 198ms · $0.31/1M

✓placed in 0.8ms · queue depth 2

Quantization

fp8 / awq / gptq, picked for you

We benchmark each format against your eval set and deploy the one that holds accuracy within 0.4 points at the best tok/s.

fp8 · default awq-int4 gptq-int8

Observability

Per-token traces, OTel-native

Every span — admission, prefill, decode, post — exported to your collector. Latency budgets alert before they breach.

Autoscale

Scale to zero between bursts

Warm pools hold a floor; everything above scales on token-second demand and parks at zero when idle.

Spend caps

Hard budgets per project

Set a monthly ceiling and a degrade policy. Past the cap we shed to a cheaper pool instead of paging you.

Compliance

SOC 2 · in-region inference

Pin a model to eu-central or us-east only. Logs stay in-region; weights never leave the VPC you choose.

Latency Cost Errors Regions 7d

Req / s

2,418

P95

214ms

Error rate

0.014%

Tensorlane Self-managed

Single pane

Watch latency, cost, and errors on one timeline

Stop correlating Grafana, your cloud bill, and a log search by hand. Tensorlane stitches every signal to the same request id, so a P99 spike, the pool it landed on, and what it cost are one click apart.

Replay any 5-minute window against a candidate config before you ship it.
Per-token cost attribution down to the customer and the route.
Alert on a latency budget, not a raw threshold that drifts with load.

Read the routing guide

Pricing

Pay for tokens, not idle GPUs

Token-second metering on every tier. No minimum commit on Sandbox; reserve capacity when you need a floor.

Sandbox

shared pool

+ $0.55 / 1M tokens

Kick the tires on shared H100s with a real model and real traces.

Any open-weight model, shared GPUs
5 req/s soft cap
7-day metric retention
Community Slack support

Start free

Scale

spot blend

$0.38

/ 1M tokens · 70B fp8

Dedicated routing across spot + reserved with a P95 SLA.

Adaptive router + autoscale to zero
SLA: P95 < 240ms, 99.9% uptime
OTel traces + 90-day retention
Per-project spend caps
Email + priority Slack

Start 14-day trial

Reserved

committed GPUs

Custom

annual capacity reservation

A floor of dedicated silicon, in your region, with a named TAM.

Reserved H100 / B200 capacity
In-region inference + VPC peering
Custom P95 SLA + on-call
SOC 2 report + DPA + audit log

Talk to engineering

Customers

Teams that moved off a hand-rolled fleet

“We cut our inference bill 44% the week we cut over, and our P95 actually dropped. The spot-reclaim handling is the part I can’t replicate in-house.”

Hana Suzuki

ML Lead · Brightwave

“The per-token cost attribution settled a six-month argument with finance in an afternoon. I can finally show which customer is expensive and why.”

Kenji Mori

Platform Lead · Northwind

“We deploy a new fine-tune every Tuesday. One push, the router warms a pool, and the old version drains with zero dropped requests. It just works.”

Yusuf Abara

CTO · Stratos

FAQ

Questions an infra lead asks

If yours isn’t here, the docs go deeper — or page an engineer in the shared Slack.

How do you keep my SLA when a spot pool gets reclaimed?+

Reclaim notices arrive ~120s ahead. The router stops admitting new requests to that pool, lets in-flight decodes finish on the warm reserved floor, and migrates KV state. Across 412 reclaims yesterday, zero requests failed.

Which quantization do you run, and how much accuracy do I lose?+

Default is fp8 with per-channel scaling. We replay your eval set against fp8, awq-int4, and gptq-int8 and deploy the format that holds within 0.4 points of bf16. You can pin a format if you’d rather decide.

Can I bring my own weights and a custom inference image?+

Yes. Push a safetensors checkpoint or point us at a private HF repo, and optionally supply an OCI image with your engine. We validate it against the router contract, then schedule it like any first-party model.

How is a token-second metered for billing?+

We bill prefill and decode tokens separately at the GPU-class rate, summed per request and rounded to the microcent. Idle warm-pool time on Scale is included; you only pay for tokens that move.

Will my inference and logs stay in one region?+

Pin a model to a region set (e.g. eu-central only) and the scheduler never places it elsewhere, even under load. Traces and prompts are stored in-region; weights live inside the VPC you nominate.

Serve open models at scale without renting a fleet you can’t fill.