Week 5 – Tiered Architecture and Asymmetric Discovery

Author

Luiz Garcia

Published

January 5, 2026

Doi
Abstract

Discovered that asymmetric bit allocation across frequency tiers dramatically improves language model performance. Built overnight sweep infrastructure.

Summary

The central discovery this week: not all tokens deserve the same architecture. Frequent tokens (which have abundant training data) benefit from more bits per neuron, while rare tokens (sparse data) need fewer bits to avoid empty memory cells.

Tiered Sparse Memory

Implemented a tiered architecture where the vocabulary is split into frequency-based tiers, each with its own neuron count and bit width:

Tier 0: 100 most frequent tokens  (46% of data)  → 15 neurons, 20 bits
Tier 1: 400 medium tokens         (13% of data)  → 10 neurons, 12 bits
Tier 2: 50K+ rare tokens          (40% of data)  →  5 neurons,  8 bits

The Asymmetric Insight

Configuration Test PPL
Asymmetric (20/12/8 bits) 36,853
Uniform (20/20/20 bits) 49,675

The asymmetric config achieves 35% better perplexity. The reason is training data density per address space:

  • Tier 0 tokens have ~11,000 examples each → can fill 2^20 addresses
  • Tier 2 tokens have ~20 examples each → 2^20 addresses are mostly EMPTY

When a neuron encounters an EMPTY address, it returns 0.5 (maximum entropy) — pure noise. Fewer bits = smaller address space = more cells actually trained.

Overnight Sweep Infrastructure

Built tooling for systematic architecture exploration:

  • Per-tier metrics (accuracy broken down by tier)
  • Skip-completed experiments (resume interrupted sweeps)
  • Validation PPL on held-out data
  • JSON output for downstream analysis

Per-Cluster Optimization

Moved from global GA/TS (same config for all clusters) to per-cluster optimization. Each of the 50K+ clusters can independently have different neuron counts and bit widths. This created a much larger search space but allowed fine-grained adaptation.

Hybrid CPU+GPU Evaluation

Extended the Rust accelerator with a hybrid mode that splits work between CPU (rayon, 16 cores) and GPU (Metal, 40 cores) simultaneously, providing ~2x additional speedup.

Best Tiered Results

The best tiered configuration was discovered during the overnight sweep (see Week 9):

5-tier with EMPTY=0.0:

Config: 50,15,20;50,13,18;400,9,10;20000,7,9;rest,5,8
Tier Tokens Neurons Bits Coverage
0 50 most frequent 15 20 ~30% of data
1 Next 50 13 18 ~8% of data
2 Next 400 9 10 ~13% of data
3 Next 20,000 7 9 ~30% of data
4 Remaining ~30K 5 8 ~19% of data

Key insight: EMPTY=0.0 (abstaining from vote) beats EMPTY=0.5 (maximum entropy) by 26% in CE. When a neuron encounters an untrained address, returning 0.0 effectively says “I have no opinion” — this lets other neurons dominate the prediction rather than adding noise.

Final metrics: CE ~10.20, PPL ~26,986, Acc ~4.86%

Next

The tier configuration space is large. Need a more structured search approach — which leads to the phased coarse-to-fine search developed next week.

Reuse

CC-BY-NC-SA-4.0