Today, Anthropic open-sourced their original performance engineering take-home. The task: optimize a kernel running on a custom VLIW SIMD processor simulator. The baseline takes 147,734 cycles. Claude Opus 4.5 got it down to 1,487 cycles - a 99x speedup that beat most humans.
I’m Tristan (@trirpi), and I work on AI kernels. Let’s break down how this whole system works.
The Architecture at a Glance
This is a VLIW (Very Long Instruction Word) SIMD (Single Instruction Multiple Data) processor with a single core (older versions of the take-home had multiple cores). Let me break down what that means.
VLIW: Compiler-Scheduled Parallelism
In a traditional processor, hardware figures out at runtime which instructions can run in parallel. In a VLIW processor, that job shifts to the compiler (or in this case, you).
The single core has multiple functional units that can all execute simultaneously:
| Unit | Count | Operations |
|---|---|---|
| ALU | 12 | Scalar: +, -, *, /, ^, &, |, <<, >>, %, <, == |
| VALU | 6 | Vector (8 elements): same ops as ALU |
| LOAD | 2 | load, vload (8 words), const |
| STORE | 2 | store, vstore (8 words) |
| FLOW | 1 | select, jump, cond_jump, halt |
You pack operations into instruction bundles. Each cycle, the processor executes one bundle, dispatching operations to all the units in parallel. If you only put one operation in a bundle, the other units sit idle. That’s why the baseline is so slow.
Example bundle (executes in 1 cycle):
{"alu": [op1, op2, op3], "valu": [vop1, vop2], "load": [ld1, ld2]}
With 12 ALUs and 6 VALUs (each processing 8 elements), this single core can theoretically do 12 + 6×8 = 60 arithmetic operations per cycle.
Memory Hierarchy
flowchart LR
subgraph mem["💾 MAIN MEMORY"]
DATA["Problem Data
(tree, indices, values)"]
end
subgraph scratch["📦 SCRATCH SPACE (1536 words)"]
REG["Works like registers
All ALU ops read/write here"]
end
mem <-->|"LOAD/STORE
⚠️ 2 each per cycle"| scratch
- Main Memory: Where the problem data lives. ALU/VALU can’t access it directly.
- Scratch Space: 1536 words of fast storage. All compute operations read/write scratch addresses.
- Bottleneck: Only 2 loads and 2 stores per cycle. This is often the limiting factor, not compute.
The Execution Engines
The processor has multiple engines, each capable of executing multiple slots per cycle. From problem.py:
SLOT_LIMITS = {
"alu": 12, # 12 scalar ALU operations per cycle
"valu": 6, # 6 vector ALU operations per cycle
"load": 2, # 2 load operations per cycle
"store": 2, # 2 store operations per cycle
"flow": 1, # 1 flow control operation per cycle
"debug": 64, # Debug operations (not counted)
}
What an Instruction Bundle Looks Like
flowchart LR
subgraph bundle["📦 Instruction Bundle (1 clock cycle)"]
subgraph compute["Compute"]
ALU["alu:
('+', dest, a, b)
('-', dest, a, b)
('*', dest, a, b)
...up to 12"]
VALU["valu:
('*', vdest, va, vb)
('+', vdest, va, vb)
...up to 6"]
end
subgraph memory["Memory"]
LOAD["load:
('load', dest, addr)
('vload', vdest, addr)"]
STORE["store:
('store', addr, src)
('vstore', addr, vsrc)"]
end
subgraph control["Control"]
FLOW["flow:
('select', d, c, a, b)"]
DEBUG["debug:
('compare', loc, key)
(not counted)"]
end
end
An instruction is a Python dict mapping engine names to lists of operations. Here’s a real example:
{"valu": [("*", 4, 0, 0), ("+", 8, 4, 0)], "load": [("load", 16, 17)]}
This executes three operations in one cycle:
- Vector multiply:
scratch[4:12] = scratch[0:8] * scratch[0:8] - Vector add:
scratch[8:16] = scratch[4:12] + scratch[0:8] - Scalar load:
scratch[16] = memory[scratch[17]]
The Problem: Batched Tree Traversal
The kernel implements a batched tree traversal with hashing. Here’s the flow:
flowchart LR
subgraph rounds["🔄 16 Rounds"]
R0["Round 0"] --> R1["Round 1"] --> R2["Round 2"] --> RN["..."]
end
subgraph batch["📊 Batch of 256 items"]
B0["Item 0"]
B1["Item 1"]
B2["Item 2"]
BN["..."]
end
subgraph ALGO["⚙️ Per-item computation"]
A1["idx = indices[i]"] --> A2["val = values[i]"]
A2 --> A3["node_val = tree[idx]"]
A3 --> A4["val = hash(val ^ node_val)"]
A4 --> A5["idx = 2*idx + (1 if even else 2)"]
A5 --> A6["if idx >= n_nodes: idx = 0"]
end
rounds --> batch
batch --> ALGO
From the reference kernel:
def reference_kernel(t: Tree, inp: Input):
"""
A parallel tree traversal where at each node we set
cur_inp_val = myhash(cur_inp_val ^ node_val)
and then choose the left branch if cur_inp_val is even.
If we reach the bottom of the tree we wrap around to the top.
"""
for h in range(inp.rounds):
for i in range(len(inp.indices)):
idx = inp.indices[i]
val = inp.values[i]
val = myhash(val ^ t.values[idx])
idx = 2 * idx + (1 if val % 2 == 0 else 2)
idx = 0 if idx >= len(t.values) else idx
inp.values[i] = val
inp.indices[i] = idx
Test configuration:
- Tree height: 10 (2047 nodes in a perfect binary tree)
- Batch size: 256 items processed
- Rounds: 16 iterations
That’s 256 × 16 = 4096 traversal steps, each involving a hash computation.
The Hash Function
The hash runs 6 stages, each doing a = (a op1 const) op2 (a op3 shift):
| Stage | Formula |
|---|---|
| 0 | a = (a + 0x7ED55D16) + (a << 12) |
| 1 | a = (a ^ 0xC761C23C) ^ (a >> 19) |
| 2 | a = (a + 0x165667B1) + (a << 5) |
| 3 | a = (a + 0xD3A2646C) ^ (a << 9) |
| 4 | a = (a + 0xFD7046C5) + (a << 3) |
| 5 | a = (a ^ 0xB55A4F09) ^ (a >> 16) |
Each stage = 3 ALU ops. Total: 6 × 3 = 18 ALU operations per hash.
The hash is defined data-driven for easy kernel implementation:
HASH_STAGES = [
("+", 0x7ED55D16, "+", "<<", 12),
("^", 0xC761C23C, "^", ">>", 19),
("+", 0x165667B1, "+", "<<", 5),
("+", 0xD3A2646C, "^", "<<", 9),
("+", 0xFD7046C5, "+", "<<", 3),
("^", 0xB55A4F09, "^", ">>", 16),
]
Similar to Bob Jenkins’ hash. Each stage: a = (a op1 val1) op2 (a op3 val3)
The Memory Model
Two memory spaces:
- Main Memory (
self.mem): Problem input/output - Scratch Space (
core.scratch): 1536 words - think of it as registers + constant memory + manually managed cache
From the constants:
VLEN = 8 # Vector length: 8 elements
N_CORES = 1 # Single core (older versions had multiple)
SCRATCH_SIZE = 1536 # 1536 words of scratch space
Every ALU operation reads and writes scratch addresses. It’s like programming a GPU with shared memory, but more explicit.
The ALU Operations
def alu(self, core, op, dest, a1, a2):
a1 = core.scratch[a1]
a2 = core.scratch[a2]
match op:
case "+": res = a1 + a2
case "-": res = a1 - a2
case "*": res = a1 * a2
case "//": res = a1 // a2
case "^": res = a1 ^ a2 # XOR
case "&": res = a1 & a2 # AND
case "|": res = a1 | a2 # OR
case "<<": res = a1 << a2 # Left shift
case ">>": res = a1 >> a2 # Right shift
case "%": res = a1 % a2 # Modulo
case "<": res = int(a1 < a2) # Comparison
case "==": res = int(a1 == a2)
res = res % (2**32) # 32-bit unsigned wrap
self.scratch_write[dest] = res
Vector ALU ops - the SIMD part:
def valu(self, core, *slot):
match slot:
case ("vbroadcast", dest, src):
# Broadcast scalar to all 8 vector lanes
for i in range(VLEN):
self.scratch_write[dest + i] = core.scratch[src]
case ("multiply_add", dest, a, b, c):
# Fused multiply-add: dest = a * b + c
for i in range(VLEN):
mul = (core.scratch[a + i] * core.scratch[b + i]) % (2**32)
self.scratch_write[dest + i] = (mul + core.scratch[c + i]) % (2**32)
case (op, dest, a1, a2):
# Any scalar op applied element-wise
for i in range(VLEN):
self.alu(core, op, dest + i, a1 + i, a2 + i)
Memory Operations
Load/store - only 2 of each per cycle:
def load(self, core, *slot):
match slot:
case ("load", dest, addr):
self.scratch_write[dest] = self.mem[core.scratch[addr]]
case ("vload", dest, addr): # 8 consecutive elements
addr = core.scratch[addr]
for vi in range(VLEN):
self.scratch_write[dest + vi] = self.mem[addr + vi]
case ("const", dest, val):
self.scratch_write[dest] = (val) % (2**32)
Key bottleneck: only 2 loads per cycle. Vector loads (vload) help - 8 elements in one slot!
Flow Control
Flow ops - crucial for branchless programming:
def flow(self, core, *slot):
match slot:
case ("select", dest, cond, a, b):
# Branchless: dest = cond ? a : b
self.scratch_write[dest] = (
core.scratch[a] if core.scratch[cond] != 0 else core.scratch[b]
)
case ("vselect", dest, cond, a, b):
# Vector version
for vi in range(VLEN):
self.scratch_write[dest + vi] = (
core.scratch[a + vi]
if core.scratch[cond + vi] != 0
else core.scratch[b + vi]
)
case ("cond_jump", cond, addr):
if core.scratch[cond] != 0:
core.pc = addr
case ("jump", addr):
core.pc = addr
Why the Baseline Is So Slow
The baseline kernel deliberately uses one operation per cycle:
def build(self, slots: list[tuple[Engine, tuple]], vliw: bool = False):
# Simple slot packing that just uses one slot per instruction bundle
instrs = []
for engine, slot in slots:
instrs.append({engine: [slot]}) # One op per bundle!
return instrs
So instead of:
{"alu": [op1, op2, op3], "load": [load1]} # 1 cycle
You get:
{"alu": [op1]} # Cycle 1
{"alu": [op2]} # Cycle 2
{"alu": [op3]} # Cycle 3
{"load": [load1]} # Cycle 4
4 cycles instead of 1. The 12 ALU slots sit empty.
Debugging Tools
Perfetto Trace Viewer
The simulator outputs Chrome Trace Event Format traces viewable in Perfetto:
python perf_takehome.py Tests.test_kernel_trace
python watch_trace.py # Opens browser with live-reloading trace
The watch_trace.py server auto-reloads traces when they change - great for iterating.
Debug Instructions
Debug ops verify intermediate values without counting cycles:
body.append(("debug", ("compare", tmp_val, (round, i, "hashed_val"))))
Optimization Strategies
The key techniques:
1. VLIW Packing
Pack independent operations into the same cycle:
{"alu": [op1, op2, op3], "load": [load1, load2]} # 5 ops, 1 cycle
2. SIMD Vectorization
Process 8 batch items at once with valu and vload/vstore.
3. Software Pipelining
Overlap computation of different iterations to keep all engines busy.
4. Branchless with select
# Instead of conditional jumps:
offset = select(val % 2 == 0, 1, 2)
idx = 2*idx + offset
The Hard Part: Data Dependencies
You can’t compute the next tree index until you’ve hashed the current value. That’s a serial dependency chain. The trick is finding parallelism across different batch items.
Further Reading
- VLIW Architecture - Wikipedia
- SIMD Programming - Intel Intrinsics Guide
- Software Pipelining - Wikipedia
- Perfetto UI
- Chrome Trace Event Format
- Computer Architecture: A Quantitative Approach
- Branchless Programming