Infrastructure
GPU scarcity is the bottleneck on AI. Luma's cloud stack turns whatever silicon is available — Trainium, Nvidia, AMD, TPU, spot fleets, idle prior-gen chips — into one reliable training and inference substrate. No porting, no rewriting, no vendor lock-in.
How It Works
Three steps from a Hugging Face model to a running job on any available accelerator, with no customer porting work and no Luma IP exposed.
Point Luma at any Hugging Face checkpoint or customer model. No code changes required. Luma reads the compute graph — your weights and architecture stay in your environment.
Luma's compiler runs inside an AWS Nitro Enclave. The output is an opaque training artifact — executable on Trainium, Nvidia, AMD, or TPU — but not inspectable. Luma's kernel IP never leaves the enclave.
The compiled artifact dispatches to whatever hardware is in your fleet — spot instances, on-demand nodes, heterogeneous mixes. Checkpointing and failover are built in; spot reclamation doesn't stop the job.
Compute Capacity
On-demand GPU capacity provisioned in your AWS account. You pay AWS directly — we provide the software layer that makes every accelerator work.
| GPU | Price | Performance | Best for | |
|---|---|---|---|---|
|
B200
Blackwell · 192 GB HBM3e
|
$5.73 / GPU-hr
|
~5% slower than native
via Luma compatibility layer |
Large-scale pre-training | Available |
|
H100 SXM5
Hopper · 80 GB HBM3
|
$3.40 / GPU-hr
|
~7% slower than native
via Luma compatibility layer |
Fine-tuning, RLHF | Most popular |
|
A100 80 GB
Ampere · 80 GB HBM2e
|
$3.17 / GPU-hr
|
~4% slower than native
via Luma compatibility layer |
Research experiments | Available |
|
A100 40 GB
Ampere · 40 GB HBM2e
|
$2.29 / GPU-hr
|
~8% slower than native
via Luma compatibility layer |
Small-to-mid scale training | Available |
|
L40S
Ada Lovelace · 48 GB GDDR6
|
$1.46 / GPU-hr
|
~6% slower than native
via Luma compatibility layer |
Inference, cost-optimized training | Available |
Architecture
Four layers that together turn heterogeneous, fragmented compute into one dependable training substrate.
Luma's compiler runs inside an AWS Nitro Enclave. It authors custom kernels in NKI (Neuron Kernel Interface), lowers them through neuronx-cc to NEFF, and emits an opaque training artifact — executable but not inspectable. Unsupported Neuron ops are decomposed rather than falling back to CPU.
Luma's orchestration layer binds mixed accelerator types — H100, A100, L4, Trainium, TPU — across availability zones or regions into a single training job. The mechanism is a trade secret. From the customer's side: one job spec, whatever hardware is available.
When a spot instance is reclaimed, Luma migrates the job off the lost node, restores from the latest checkpoint, and resumes on replacement capacity — automatically. Prior-generation accelerators that would otherwise sit idle become viable training nodes.
Trained checkpoints are served via vLLM or Optimum Neuron on Trainium and Nvidia hardware. Luma benchmarks both backends — Qwen3-4B reaches ~28 tok/s on Optimum Neuron vs ~21 tok/s on vLLM on a single trn1.2xlarge — so you can choose the right stack for your latency and cost targets.
Documentation
Full documentation, quickstart guides, and API reference are in progress. Join the waitlist for early access.