Today, Anthropic open-sourced their original performance engineering take-home. The task: optimize a kernel running on a custom VLIW SIMD processor simulator. The baseline takes 147,734 cycles. Claude Opus 4.5 got it down to 1,487 cycles - a 99x speedup that beat most humans.

I’m Tristan (@trirpi), and I work on AI kernels. Let’s break down how this whole system works.


The Architecture at a Glance

This is a VLIW (Very Long Instruction Word) SIMD (Single Instruction Multiple Data) processor with a single core (older versions of the take-home had multiple cores). Let me break down what that means.

VLIW: Compiler-Scheduled Parallelism

In a traditional processor, hardware figures out at runtime which instructions can run in parallel. In a VLIW processor, that job shifts to the compiler (or in this case, you).

The single core has multiple functional units that can all execute simultaneously:

UnitCountOperations
ALU12Scalar: +, -, *, /, ^, &, |, <<, >>, %, <, ==
VALU6Vector (8 elements): same ops as ALU
LOAD2load, vload (8 words), const
STORE2store, vstore (8 words)
FLOW1select, jump, cond_jump, halt

You pack operations into instruction bundles. Each cycle, the processor executes one bundle, dispatching operations to all the units in parallel. If you only put one operation in a bundle, the other units sit idle. That’s why the baseline is so slow.

Example bundle (executes in 1 cycle):

{"alu": [op1, op2, op3], "valu": [vop1, vop2], "load": [ld1, ld2]}

With 12 ALUs and 6 VALUs (each processing 8 elements), this single core can theoretically do 12 + 6×8 = 60 arithmetic operations per cycle.

Memory Hierarchy

flowchart LR
    subgraph mem["💾 MAIN MEMORY"]
        DATA["Problem Data
(tree, indices, values)"] end subgraph scratch["📦 SCRATCH SPACE (1536 words)"] REG["Works like registers
All ALU ops read/write here"] end mem <-->|"LOAD/STORE
⚠️ 2 each per cycle"| scratch
  • Main Memory: Where the problem data lives. ALU/VALU can’t access it directly.
  • Scratch Space: 1536 words of fast storage. All compute operations read/write scratch addresses.
  • Bottleneck: Only 2 loads and 2 stores per cycle. This is often the limiting factor, not compute.

The Execution Engines

The processor has multiple engines, each capable of executing multiple slots per cycle. From problem.py:

SLOT_LIMITS = {
    "alu": 12,      # 12 scalar ALU operations per cycle
    "valu": 6,      # 6 vector ALU operations per cycle
    "load": 2,      # 2 load operations per cycle
    "store": 2,     # 2 store operations per cycle
    "flow": 1,      # 1 flow control operation per cycle
    "debug": 64,    # Debug operations (not counted)
}

What an Instruction Bundle Looks Like

flowchart LR
    subgraph bundle["📦 Instruction Bundle (1 clock cycle)"]
        subgraph compute["Compute"]
            ALU["alu:
('+', dest, a, b)
('-', dest, a, b)
('*', dest, a, b)
...up to 12"] VALU["valu:
('*', vdest, va, vb)
('+', vdest, va, vb)
...up to 6"] end subgraph memory["Memory"] LOAD["load:
('load', dest, addr)
('vload', vdest, addr)"] STORE["store:
('store', addr, src)
('vstore', addr, vsrc)"] end subgraph control["Control"] FLOW["flow:
('select', d, c, a, b)"] DEBUG["debug:
('compare', loc, key)
(not counted)"] end end

An instruction is a Python dict mapping engine names to lists of operations. Here’s a real example:

{"valu": [("*", 4, 0, 0), ("+", 8, 4, 0)], "load": [("load", 16, 17)]}

This executes three operations in one cycle:

  1. Vector multiply: scratch[4:12] = scratch[0:8] * scratch[0:8]
  2. Vector add: scratch[8:16] = scratch[4:12] + scratch[0:8]
  3. Scalar load: scratch[16] = memory[scratch[17]]

The Problem: Batched Tree Traversal

The kernel implements a batched tree traversal with hashing. Here’s the flow:

flowchart LR
    subgraph rounds["🔄 16 Rounds"]
        R0["Round 0"] --> R1["Round 1"] --> R2["Round 2"] --> RN["..."]
    end
    
    subgraph batch["📊 Batch of 256 items"]
        B0["Item 0"]
        B1["Item 1"]
        B2["Item 2"]
        BN["..."]
    end
    
    subgraph ALGO["⚙️ Per-item computation"]
        A1["idx = indices[i]"] --> A2["val = values[i]"]
        A2 --> A3["node_val = tree[idx]"]
        A3 --> A4["val = hash(val ^ node_val)"]
        A4 --> A5["idx = 2*idx + (1 if even else 2)"]
        A5 --> A6["if idx >= n_nodes: idx = 0"]
    end
    
    rounds --> batch
    batch --> ALGO

From the reference kernel:

def reference_kernel(t: Tree, inp: Input):
    """
    A parallel tree traversal where at each node we set
    cur_inp_val = myhash(cur_inp_val ^ node_val)
    and then choose the left branch if cur_inp_val is even.
    If we reach the bottom of the tree we wrap around to the top.
    """
    for h in range(inp.rounds):
        for i in range(len(inp.indices)):
            idx = inp.indices[i]
            val = inp.values[i]
            val = myhash(val ^ t.values[idx])
            idx = 2 * idx + (1 if val % 2 == 0 else 2)
            idx = 0 if idx >= len(t.values) else idx
            inp.values[i] = val
            inp.indices[i] = idx

Test configuration:

  • Tree height: 10 (2047 nodes in a perfect binary tree)
  • Batch size: 256 items processed
  • Rounds: 16 iterations

That’s 256 × 16 = 4096 traversal steps, each involving a hash computation.

The Hash Function

The hash runs 6 stages, each doing a = (a op1 const) op2 (a op3 shift):

StageFormula
0a = (a + 0x7ED55D16) + (a << 12)
1a = (a ^ 0xC761C23C) ^ (a >> 19)
2a = (a + 0x165667B1) + (a << 5)
3a = (a + 0xD3A2646C) ^ (a << 9)
4a = (a + 0xFD7046C5) + (a << 3)
5a = (a ^ 0xB55A4F09) ^ (a >> 16)

Each stage = 3 ALU ops. Total: 6 × 3 = 18 ALU operations per hash.

The hash is defined data-driven for easy kernel implementation:

HASH_STAGES = [
    ("+", 0x7ED55D16, "+", "<<", 12),
    ("^", 0xC761C23C, "^", ">>", 19),
    ("+", 0x165667B1, "+", "<<", 5),
    ("+", 0xD3A2646C, "^", "<<", 9),
    ("+", 0xFD7046C5, "+", "<<", 3),
    ("^", 0xB55A4F09, "^", ">>", 16),
]

Similar to Bob Jenkins’ hash. Each stage: a = (a op1 val1) op2 (a op3 val3)


The Memory Model

Two memory spaces:

  1. Main Memory (self.mem): Problem input/output
  2. Scratch Space (core.scratch): 1536 words - think of it as registers + constant memory + manually managed cache

From the constants:

VLEN = 8          # Vector length: 8 elements
N_CORES = 1       # Single core (older versions had multiple)
SCRATCH_SIZE = 1536  # 1536 words of scratch space

Every ALU operation reads and writes scratch addresses. It’s like programming a GPU with shared memory, but more explicit.


The ALU Operations

Scalar ALU ops:

def alu(self, core, op, dest, a1, a2):
    a1 = core.scratch[a1]
    a2 = core.scratch[a2]
    match op:
        case "+":  res = a1 + a2
        case "-":  res = a1 - a2
        case "*":  res = a1 * a2
        case "//": res = a1 // a2
        case "^":  res = a1 ^ a2       # XOR
        case "&":  res = a1 & a2       # AND
        case "|":  res = a1 | a2       # OR
        case "<<": res = a1 << a2      # Left shift
        case ">>": res = a1 >> a2      # Right shift
        case "%":  res = a1 % a2       # Modulo
        case "<":  res = int(a1 < a2)  # Comparison
        case "==": res = int(a1 == a2)
    res = res % (2**32)  # 32-bit unsigned wrap
    self.scratch_write[dest] = res

Vector ALU ops - the SIMD part:

def valu(self, core, *slot):
    match slot:
        case ("vbroadcast", dest, src):
            # Broadcast scalar to all 8 vector lanes
            for i in range(VLEN):
                self.scratch_write[dest + i] = core.scratch[src]
        case ("multiply_add", dest, a, b, c):
            # Fused multiply-add: dest = a * b + c
            for i in range(VLEN):
                mul = (core.scratch[a + i] * core.scratch[b + i]) % (2**32)
                self.scratch_write[dest + i] = (mul + core.scratch[c + i]) % (2**32)
        case (op, dest, a1, a2):
            # Any scalar op applied element-wise
            for i in range(VLEN):
                self.alu(core, op, dest + i, a1 + i, a2 + i)

Memory Operations

Load/store - only 2 of each per cycle:

def load(self, core, *slot):
    match slot:
        case ("load", dest, addr):
            self.scratch_write[dest] = self.mem[core.scratch[addr]]
        case ("vload", dest, addr):  # 8 consecutive elements
            addr = core.scratch[addr]
            for vi in range(VLEN):
                self.scratch_write[dest + vi] = self.mem[addr + vi]
        case ("const", dest, val):
            self.scratch_write[dest] = (val) % (2**32)

Key bottleneck: only 2 loads per cycle. Vector loads (vload) help - 8 elements in one slot!


Flow Control

Flow ops - crucial for branchless programming:

def flow(self, core, *slot):
    match slot:
        case ("select", dest, cond, a, b):
            # Branchless: dest = cond ? a : b
            self.scratch_write[dest] = (
                core.scratch[a] if core.scratch[cond] != 0 else core.scratch[b]
            )
        case ("vselect", dest, cond, a, b):
            # Vector version
            for vi in range(VLEN):
                self.scratch_write[dest + vi] = (
                    core.scratch[a + vi]
                    if core.scratch[cond + vi] != 0
                    else core.scratch[b + vi]
                )
        case ("cond_jump", cond, addr):
            if core.scratch[cond] != 0:
                core.pc = addr
        case ("jump", addr):
            core.pc = addr

Why the Baseline Is So Slow

The baseline kernel deliberately uses one operation per cycle:

def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
    # Simple slot packing that just uses one slot per instruction bundle
    instrs = []
    for engine, slot in slots:
        instrs.append({engine: [slot]})  # One op per bundle!
    return instrs

So instead of:

{"alu": [op1, op2, op3], "load": [load1]}  # 1 cycle

You get:

{"alu": [op1]}    # Cycle 1
{"alu": [op2]}    # Cycle 2
{"alu": [op3]}    # Cycle 3
{"load": [load1]} # Cycle 4

4 cycles instead of 1. The 12 ALU slots sit empty.


Debugging Tools

Perfetto Trace Viewer

The simulator outputs Chrome Trace Event Format traces viewable in Perfetto:

python perf_takehome.py Tests.test_kernel_trace
python watch_trace.py  # Opens browser with live-reloading trace

The watch_trace.py server auto-reloads traces when they change - great for iterating.

Debug Instructions

Debug ops verify intermediate values without counting cycles:

body.append(("debug", ("compare", tmp_val, (round, i, "hashed_val"))))

Optimization Strategies

The key techniques:

1. VLIW Packing

Pack independent operations into the same cycle:

{"alu": [op1, op2, op3], "load": [load1, load2]}  # 5 ops, 1 cycle

2. SIMD Vectorization

Process 8 batch items at once with valu and vload/vstore.

3. Software Pipelining

Overlap computation of different iterations to keep all engines busy.

4. Branchless with select

# Instead of conditional jumps:
offset = select(val % 2 == 0, 1, 2)
idx = 2*idx + offset

The Hard Part: Data Dependencies

You can’t compute the next tree index until you’ve hashed the current value. That’s a serial dependency chain. The trick is finding parallelism across different batch items.


Further Reading


Found this useful? Follow me on Twitter/X or GitHub.