Infrastructure

One stack.
Every accelerator.

GPU scarcity is the bottleneck on AI. Luma's cloud stack turns whatever silicon is available — Trainium, Nvidia, AMD, TPU, spot fleets, idle prior-gen chips — into one reliable training and inference substrate. No porting, no rewriting, no vendor lock-in.

How It Works

From model spec to training job — sealed.

Three steps from a Hugging Face model to a running job on any available accelerator, with no customer porting work and no Luma IP exposed.

Submit your model spec

Point Luma at any Hugging Face checkpoint or customer model. No code changes required. Luma reads the compute graph — your weights and architecture stay in your environment.

Sealed compile in a Nitro Enclave

Luma's compiler runs inside an AWS Nitro Enclave. The output is an opaque training artifact — executable on Trainium, Nvidia, AMD, or TPU — but not inspectable. Luma's kernel IP never leaves the enclave.

Run on available compute

The compiled artifact dispatches to whatever hardware is in your fleet — spot instances, on-demand nodes, heterogeneous mixes. Checkpointing and failover are built in; spot reclamation doesn't stop the job.

Compute Capacity

Run on AWS, in your own account.

On-demand GPU capacity provisioned in your AWS account. You pay AWS directly — we provide the software layer that makes every accelerator work.

Prices shown are representative on-demand rates. Actual cost depends on your AWS region, commitment level, and instance type.

GPU	Price	Performance	Best for
B200 Blackwell · 192 GB HBM3e	$5.73 / GPU-hr	~5% slower than native via Luma compatibility layer	Large-scale pre-training	Available
H100 SXM5 Hopper · 80 GB HBM3	$3.40 / GPU-hr	~7% slower than native via Luma compatibility layer	Fine-tuning, RLHF	Most popular
A100 80 GB Ampere · 80 GB HBM2e	$3.17 / GPU-hr	~4% slower than native via Luma compatibility layer	Research experiments	Available
A100 40 GB Ampere · 40 GB HBM2e	$2.29 / GPU-hr	~8% slower than native via Luma compatibility layer	Small-to-mid scale training	Available
L40S Ada Lovelace · 48 GB GDDR6	$1.46 / GPU-hr	~6% slower than native via Luma compatibility layer	Inference, cost-optimized training	Available

Architecture

What's under the hood

Four layers that together turn heterogeneous, fragmented compute into one dependable training substrate.

Sealed Compiler

Nitro Enclave · NKI kernels · NEFF

Luma's compiler runs inside an AWS Nitro Enclave. It authors custom kernels in NKI (Neuron Kernel Interface), lowers them through neuronx-cc to NEFF, and emits an opaque training artifact — executable but not inspectable. Unsupported Neuron ops are decomposed rather than falling back to CPU.

Fleet Orchestration

Heterogeneous cluster as one logical job

Luma's orchestration layer binds mixed accelerator types — H100, A100, L4, Trainium, TPU — across availability zones or regions into a single training job. The mechanism is a trade secret. From the customer's side: one job spec, whatever hardware is available.

Spot & Failover

Checkpointing · node swap · zero lost work

When a spot instance is reclaimed, Luma migrates the job off the lost node, restores from the latest checkpoint, and resumes on replacement capacity — automatically. Prior-generation accelerators that would otherwise sit idle become viable training nodes.

Inference Layer

vLLM · Optimum Neuron · OpenAI-compatible API

Trained checkpoints are served via vLLM or Optimum Neuron on Trainium and Nvidia hardware. Luma benchmarks both backends — Qwen3-4B reaches ~28 tok/s on Optimum Neuron vs ~21 tok/s on vLLM on a single trn1.2xlarge — so you can choose the right stack for your latency and cost targets.

One stack.Every accelerator.