Tensorlane packs your inference onto spot and reserved GPUs, autoscales by token-second, and holds P95 under 240ms at 90% utilization. You ship a model; we keep the silicon busy.
412 spot interruptions drained and rerouted in the last day — zero requests failed, zero pages.
Inference running in production at
Benchmarks
Every figure below is from a 7-day replay of a mixed chat + RAG workload at 2,400 req/s. We re-run it nightly against the live fleet and post the regression diff.
Continuous batching + paged KV cache. 2.1× the tok/s of a naive vLLM deploy on the same H100.
Speculative decoding plus a warm pool keep TTFT flat through autoscale events.
Spot-aware scheduler keeps interruptible silicon at 64% of the blend without touching your SLA.
Platform
Routing, quantization, autoscaling, observability, and spend governance — the pieces you would otherwise stitch from five tools.
The scheduler reads live queue depth, KV-cache pressure, and spot reclaim signals, then places each request on the pool that meets your P95 for the least money. Reclaims drain in flight.
We benchmark each format against your eval set and deploy the one that holds accuracy within 0.4 points at the best tok/s.
Every span — admission, prefill, decode, post — exported to your collector. Latency budgets alert before they breach.
Warm pools hold a floor; everything above scales on token-second demand and parks at zero when idle.
Set a monthly ceiling and a degrade policy. Past the cap we shed to a cheaper pool instead of paging you.
Pin a model to eu-central or us-east only. Logs stay in-region; weights never leave the VPC you choose.
Single pane
Stop correlating Grafana, your cloud bill, and a log search by hand. Tensorlane stitches every signal to the same request id, so a P99 spike, the pool it landed on, and what it cost are one click apart.
Pricing
Token-second metering on every tier. No minimum commit on Sandbox; reserve capacity when you need a floor.
Kick the tires on shared H100s with a real model and real traces.
Dedicated routing across spot + reserved with a P95 SLA.
A floor of dedicated silicon, in your region, with a named TAM.
Customers
“We cut our inference bill 44% the week we cut over, and our P95 actually dropped. The spot-reclaim handling is the part I can’t replicate in-house.”
“The per-token cost attribution settled a six-month argument with finance in an afternoon. I can finally show which customer is expensive and why.”
“We deploy a new fine-tune every Tuesday. One push, the router warms a pool, and the old version drains with zero dropped requests. It just works.”
FAQ
If yours isn’t here, the docs go deeper — or page an engineer in the shared Slack.
Reclaim notices arrive ~120s ahead. The router stops admitting new requests to that pool, lets in-flight decodes finish on the warm reserved floor, and migrates KV state. Across 412 reclaims yesterday, zero requests failed.
Default is fp8 with per-channel scaling. We replay your eval set against fp8, awq-int4, and gptq-int8 and deploy the format that holds within 0.4 points of bf16. You can pin a format if you’d rather decide.
Yes. Push a safetensors checkpoint or point us at a private HF repo, and optionally supply an OCI image with your engine. We validate it against the router contract, then schedule it like any first-party model.
We bill prefill and decode tokens separately at the GPU-class rate, summed per request and rounded to the microcent. Idle warm-pool time on Scale is included; you only pay for tokens that move.
Pin a model to a region set (e.g. eu-central only) and the scheduler never places it elsewhere, even under load. Traces and prompts are stored in-region; weights live inside the VPC you nominate.
No commit, no fleet to provision. Push a checkpoint, get a streaming endpoint, watch the traces light up.