Chapter 38: Hardware-aware kernel design: CUDA, CUTLASS, Triton, TVM

In Chapters 25 (FlashAttention) and 26 (quantization) we saw the impact of custom GPU kernels: a single well-written kernel can be 2-4× faster than the naive composition of standard operations. The question this chapter addresses: how do those kernels get written, and what tools do you reach for when you need to optimize at the kernel level?

This is a chapter about the layer below the model. By the end you will understand the kernel-writing ecosystem (CUDA, CUTLASS, Triton, TVM, and a handful of other approaches), what each is good at, when you’d write your own kernel vs use one off the shelf, and why this layer matters for serving. You will not become a kernel author from this chapter — that’s a multi-month skill — but you’ll understand the landscape well enough to make decisions about it.

Outline:

The kernel layer in context.
Raw CUDA — the lowest level.
CUTLASS — NVIDIA’s template library.
Triton — the Python DSL.
TVM and other compilers.
PyTorch’s torch.compile.
The kernel ecosystems: vLLM, SGLang, TensorRT-LLM, llama.cpp.
When you need a custom kernel.
The state of the art in 2025.

38.1 The kernel layer in context

A modern LLM inference stack has several layers:

The kernel layer sits one level below PyTorch and one above the CUDA runtime — it is where matmul, attention, and normalization operations actually run on Tensor Cores, and where the performance differences between serving stacks live.

The kernel layer is where the actual GPU work happens. A “kernel” is a function that runs on the GPU’s SMs (Streaming Multiprocessors), processing tensors stored in HBM. Most ML operations — matmul, convolution, softmax, layer norm, attention — eventually call into one or more kernels.

Most of the time, you don’t write kernels yourself. You use the ones that ship with PyTorch, cuBLAS, FlashAttention, or vLLM. They’re highly optimized and cover the common cases. The reason to dive into kernel writing is when:

The standard kernels don’t cover your specific operation.
The standard kernels exist but are sub-optimal for your specific shapes.
You’re targeting a new hardware feature (e.g., Hopper TMA, Blackwell features) that doesn’t have library support yet.
You want to experiment with a research idea before it gets upstreamed.

For most practitioners, this is a small fraction of work. For frontier labs, it’s most of the work — the difference between a “good enough” model serving setup and a state-of-the-art one is often a handful of custom kernels.

38.2 Raw CUDA

The lowest level is raw CUDA C++. You write kernels in a C++ dialect that compiles to PTX (NVIDIA’s intermediate language) and then to SASS (the actual GPU assembly).

A CUDA kernel looks like:

__global__ void add_vectors(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

That’s a vector addition kernel. The __global__ annotation marks it as runnable on the GPU. blockIdx, blockDim, threadIdx are the parallelism primitives — the kernel runs in parallel across many threads, organized into blocks.

Raw CUDA gives you the most control. You can:

Manage shared memory (the SRAM) explicitly.
Use warp-level primitives for fine-grained synchronization.
Coalesce memory accesses for maximum bandwidth.
Hand-tune for specific GPU architectures.

The cost is that it’s hard. A correct CUDA kernel for matmul (the kind that competes with cuBLAS) is hundreds of lines of dense C++ with intricate memory layout and synchronization. The standard reference is the CUDA samples in NVIDIA’s documentation; the real reference is the CUTLASS source code, which is what you write CUDA C++ “in the style of” if you want production-quality performance.

For most ML practitioners, don’t write raw CUDA. The barrier is too high and the alternatives (Triton, CUTLASS templates) get you most of the way at much lower cost. Raw CUDA is for kernel specialists at the frontier labs, kernel maintainers at NVIDIA, and a handful of others.

38.3 CUTLASS — NVIDIA’s template library

CUTLASS is NVIDIA’s open-source template library for matrix multiplication. The pitch: CUDA kernels for matmul are very repetitive (tile loop, load to SRAM, compute, write back), and the best implementation depends on the matrix shapes, dtypes, and target GPU. CUTLASS provides C++ templates that generate high-performance matmul kernels for any combination of these.

You don’t write the kernel from scratch; you specify the configuration (tile sizes, dtype, layout, etc.) and CUTLASS generates the kernel for you. The result is often within a few percent of cuBLAS performance, but with more flexibility — you can compose CUTLASS into custom operations that cuBLAS doesn’t support.

CUTLASS is the backbone of many high-performance kernels:

The FlashAttention 2/3 kernels are built on CUTLASS.
TensorRT-LLM uses CUTLASS extensively.
vLLM’s INT4 Marlin kernel uses CUTLASS templates.

The complexity is real. CUTLASS requires you to understand C++ templates, GPU memory hierarchy, warp-level operations, and Tensor Core programming. It’s much harder than Python but much more powerful than off-the-shelf libraries.

If you’re writing production-grade LLM kernels, CUTLASS is the toolbox. If you’re writing experimental research kernels, Triton (next section) is usually easier.

38.4 Triton — the Python DSL

Triton (Tillet et al., 2019, originally OpenAI) is a Python DSL for writing GPU kernels. It’s the closest thing to a “kernel for the rest of us” — you write Python that looks vaguely like NumPy, and the Triton compiler generates GPU code.

A Triton kernel for vector addition:

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

That’s a complete Triton kernel for vector addition. Comparable to the CUDA version above but in Python and substantially shorter.

The Triton compiler handles:

Memory coalescing (it generates efficient access patterns).
SRAM usage (it picks tile sizes and manages the shared memory).
Warp-level operations (you don’t think about warps directly).
Parallelism (you specify a grid; Triton handles thread mapping).

The result is kernels that are typically within 80-95% of hand-tuned CUDA performance, written in 1/5 the lines of code.

Triton is the dominant DSL for ML kernel research as of late 2025. The major libraries that use it:

FlashAttention has Triton implementations alongside the CUDA originals.
PyTorch’s torch.compile uses Triton as one of its backends.
vLLM has Triton implementations for some operations.
liger-kernel is a Triton-based library of optimized kernels for transformer training and inference.
xFormers has Triton kernels for various attention variants.

If you want to write a custom kernel today, start with Triton. Drop down to CUTLASS or raw CUDA only if you need the absolute last 10% of performance or you’re hitting Triton’s limitations.

graph TD
  Start[Need a custom kernel?] --> Q1{Does a standard kernel exist?}
  Q1 -->|Yes, use it| End1[cuBLAS / FlashAttention / vLLM built-ins]
  Q1 -->|No or wrong shape| Q2{Research prototype or production?}
  Q2 -->|Prototype| Triton[Write in Triton — Python DSL]
  Q2 -->|Production, need last 10%| Q3{NVIDIA GPU matmul?}
  Q3 -->|Yes| CUTLASS[Use CUTLASS templates]
  Q3 -->|Other op / need full control| CUDA[Raw CUDA C++]
  style Triton fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style End1 fill:var(--fig-surface),stroke:var(--fig-border)

The kernel-choice decision tree: start at the top and stop as early as possible — raw CUDA is rarely the right answer.

38.5 TVM and other compilers

TVM (Tianqi Chen et al., 2018) is an end-to-end deep learning compiler. The pitch: instead of writing kernels by hand, describe the computation declaratively, and let the compiler search over implementation choices to find the fastest one for your specific hardware.

The TVM workflow:

Define the computation in TVM’s IR (intermediate representation).
Define a “schedule” — how the computation should be parallelized, tiled, vectorized.
Search over schedules using auto-tuning (AutoTVM, Ansor, MetaSchedule).
Generate code for the target hardware.

TVM was the leading approach in 2019-2021 and has been used in production at large scale. Its strength is portability across hardware — the same TVM-defined computation can target NVIDIA GPUs, AMD GPUs, ARM CPUs, mobile devices, etc.

The weakness is that TVM has been somewhat overtaken by simpler alternatives:

Triton is easier to use for new kernel writing.
MLIR/IREE is the modern compiler infrastructure that Google bets on.
PyTorch’s torch.compile uses TorchInductor (which generates Triton) instead of TVM.

TVM is still actively developed and used in some production settings, especially where multi-hardware portability matters. But for pure NVIDIA serving, Triton + CUTLASS has won the mindshare.

Other compilers worth knowing:

MLIR — the LLVM-based ML compiler infrastructure.
IREE — Google’s MLIR-based runtime.
TensorRT — NVIDIA’s inference compiler. Optimizes models for NVIDIA hardware specifically.
JAX/XLA — Google’s compiler that powers JAX and TPU code.

Each has its niche. For LLM inference on NVIDIA GPUs in 2025, the dominant stack is PyTorch + Triton-based kernels + CUTLASS for the heaviest matmul work.

38.6 PyTorch’s `torch.compile`

torch.compile is PyTorch’s just-in-time compiler, introduced in PyTorch 2.0. The idea: take an arbitrary PyTorch model and compile it into a faster version automatically, without the user writing any kernels.

The pipeline:

TorchDynamo captures the Python code into an FX graph.
AOTAutograd handles the autograd partitioning.
TorchInductor is the backend code generator. For GPU, it generates Triton code; for CPU, it generates C++.
The compiled kernels are cached for reuse.

The end result: most PyTorch models get a 1.5-2× speedup with one line of code: model = torch.compile(model).

For LLM inference, torch.compile is useful for:

Quick experiments where you don’t want to switch to vLLM yet.
Fine-tuning loops where you want to speed up the forward and backward passes.
Custom model architectures that vLLM doesn’t support.

It’s not as fast as vLLM’s hand-optimized stack for production serving, but it’s a much lower bar to entry. If you’re prototyping a new architecture, torch.compile gets you “fast enough” without writing any kernels.

38.7 The kernel ecosystems

The major LLM serving stacks each have their own kernel approach. Understanding the landscape:

Each major serving stack has a distinct kernel strategy — vLLM uses CUTLASS for matmul and FlashAttention for attention; llama.cpp writes everything from scratch; the quality of these kernels is what separates peak throughput between frameworks.

vLLM

vLLM uses a mix:

Custom CUDA/CUTLASS kernels for the hottest paths (paged attention, INT4 Marlin matmul).
FlashAttention for the standard attention.
Triton for some specialized kernels (e.g., custom samplers).
PyTorch for everything else.

vLLM’s kernel team is small but produces high-quality work. The Marlin INT4 kernel and the paged attention kernel are both world-class.

SGLang

Similar to vLLM:

Heavy use of FlashAttention.
Custom kernels for RadixAttention (the prefix cache, Chapter 29).
Triton kernels for various optimizations.
Recent work on FP8 attention kernels for Hopper.

TensorRT-LLM

NVIDIA’s official LLM serving framework. Uses NVIDIA’s full stack:

CUTLASS for matmul.
Custom CUDA for attention and other operations.
TensorRT compiler for optimization passes.
CUDA Graphs for low-overhead launches.

TensorRT-LLM is often the fastest for NVIDIA hardware because NVIDIA’s kernel team has the most resources. The cost is operational complexity and being locked into NVIDIA’s stack.

llama.cpp

A different lineage. llama.cpp is a from-scratch C++ inference engine, originally for CPUs but now with GPU support via CUDA, Metal, and Vulkan. It uses:

Custom CUDA kernels written in straightforward C++.
Hand-written quantization formats (the gguf family).
No framework dependencies — everything is built from scratch.

llama.cpp’s kernels are often slower than vLLM’s at the same precision, but the operational simplicity is huge. It’s the dominant choice for consumer/local LLM serving.

MLC-LLM

A TVM-based serving framework. Uses TVM’s compilation approach to target many platforms (NVIDIA GPUs, AMD GPUs, mobile, web). Slower than vLLM on NVIDIA but more portable.

The takeaway: the kernel layer is where the performance differences between serving stacks live. vLLM and TensorRT-LLM compete on kernel performance. SGLang competes on prefix caching kernel performance. llama.cpp competes on simplicity. MLC-LLM competes on portability.

38.8 When you need a custom kernel

The honest answer: you almost never do. The standard kernels (FlashAttention, cuBLAS, vLLM’s paged attention) cover the vast majority of cases at high performance. Writing your own kernel is a multi-week project that often results in something slower than the standard option.

The cases where you actually need a custom kernel:

(1) A new architecture variant. If you’ve invented a new attention variant or a new normalization that doesn’t have library support, you have to write the kernel yourself. This is the bulk of new-kernel work — research that needs production-quality kernels.

(2) An operation with specific shape constraints. If your model has unusual matrix dimensions that aren’t well-supported by cuBLAS, a custom kernel can outperform.

(3) Targeting new hardware. When Hopper added TMA and WGMMA, the existing kernels didn’t use them. New kernels had to be written to take advantage. Same when Blackwell launches — the early adopters write new kernels.

(4) Specialized fusion. Sometimes you have an operation chain like “RMSNorm → linear → silu → mul → linear” that, when fused into one kernel, is faster than calling each separately. Liger-kernel and similar libraries do this kind of fusion for transformer training.

(5) Quantization formats. New quantization schemes (NF4, MXFP4, custom group sizes) usually need custom kernels because the standard libraries don’t support them.

For everything else, use the existing kernels. The wins from custom kernels are typically 10-30% — real but not enormous, and the engineering cost is high.

38.9 The state of the art in 2025

Where the kernel ecosystem is in late 2025:

Triton is the dominant DSL for ML kernel research. Most new optimization papers ship Triton implementations. The Triton community is large and active.

FlashAttention 3 is the canonical attention kernel for Hopper. It uses TMA, WGMMA, and async-matmul. There’s no faster open implementation; competing kernels (e.g., from TensorRT-LLM) are typically equal or marginally slower.

Marlin (INT4 GEMM) is the canonical INT4 matmul kernel for vLLM. Outperforms naive INT4 implementations significantly.

FlashInfer is a newer library specifically for LLM inference operations (paged attention, prefill, sampling). Used by SGLang and increasingly by other serving stacks.

Liger-kernel is the leading Triton-based library for transformer training kernel fusion. Used in production at Meta and elsewhere.

torch.compile has matured significantly and now produces competitive kernels for many operations. It’s the default for prototyping.

The frontier work is:

Hopper and Blackwell exploitation — using new hardware features like FP8 sparse, TMA, async DSMEM.
Multi-GPU kernel design — kernels that span multiple GPUs via NVLink without software overhead.
Disaggregated PD kernels — efficient KV transfer kernels for the disaggregated setting.

The kernel layer is moving fast. New papers come out monthly. The good news for most practitioners: you don’t have to keep up with all of it — the existing libraries (vLLM, FlashAttention, CUTLASS) absorb the improvements over time, and you get them for free by upgrading.

38.10 The mental model

Eight points to take into Chapter 41:

The kernel layer is where performance lives. Standard kernels cover most cases; custom kernels close the last 10-30% gap.
Raw CUDA is the lowest level. Maximum control, maximum effort. For specialists.
CUTLASS is NVIDIA’s template library. Used in FlashAttention 2/3, TensorRT-LLM, and many production kernels.
Triton is the Python DSL. Easy enough to learn, fast enough for production. The dominant new-kernel choice.
TVM and similar compilers offer portability at the cost of complexity. Niche.
torch.compile is PyTorch’s auto-compiler. Generates Triton kernels under the hood. Good for prototyping.
vLLM, SGLang, TensorRT-LLM, llama.cpp each have their own kernel ecosystem. The performance differences come from the kernels, not the framework around them.
You almost never need to write kernels. Use the existing libraries. Write custom kernels only for new architectures or new hardware features.

In Chapter 41 we look at the alternative to attention entirely: state-space models and Mamba.

Read it yourself

The CUDA programming guide. Read sections on memory hierarchy and warp-level operations.
The CUTLASS GitHub repository. Read the README and the matmul tutorial.
Tillet et al., Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (2019). The Triton paper.
The Triton tutorials at triton-lang.org.
The FlashAttention CUDA kernel source. Hard to read but rewarding.
The PyTorch torch.compile blog posts and documentation.
The Liger-kernel GitHub repository for fused kernel examples.

Practice

Write a Triton kernel for elementwise addition of two vectors. Run it and verify against PyTorch.
Read the FlashAttention v2 Triton implementation. Identify the tile loop, the online softmax, and the output accumulation.
Why is Triton easier than raw CUDA for ML kernel writing? List three specific things Triton handles automatically.
When would you reach for CUTLASS over Triton for a new kernel? When would you reach for raw CUDA over CUTLASS?
Read the vLLM Marlin INT4 kernel source. How does it use Tensor Cores for INT4 multiplication?
Why does torch.compile produce slower code than vLLM for LLM inference? Identify two reasons.
Stretch: Implement a simple matmul kernel in Triton. Compare its performance to torch.matmul on (1024, 1024) @ (1024, 1024) matrices.

Concept check