Week 5 – Tiered Architecture and Asymmetric Discovery
Discovered that asymmetric bit allocation across frequency tiers dramatically improves language model performance. Built overnight sweep infrastructure.
Summary
The central discovery this week: not all tokens deserve the same architecture. Frequent tokens (which have abundant training data) benefit from more bits per neuron, while rare tokens (sparse data) need fewer bits to avoid empty memory cells.
Tiered Sparse Memory
Implemented a tiered architecture where the vocabulary is split into frequency-based tiers, each with its own neuron count and bit width:
Tier 0: 100 most frequent tokens (46% of data) → 15 neurons, 20 bits
Tier 1: 400 medium tokens (13% of data) → 10 neurons, 12 bits
Tier 2: 50K+ rare tokens (40% of data) → 5 neurons, 8 bits
The Asymmetric Insight
| Configuration | Test PPL |
|---|---|
| Asymmetric (20/12/8 bits) | 36,853 |
| Uniform (20/20/20 bits) | 49,675 |
The asymmetric config achieves 35% better perplexity. The reason is training data density per address space:
- Tier 0 tokens have ~11,000 examples each → can fill 2^20 addresses
- Tier 2 tokens have ~20 examples each → 2^20 addresses are mostly EMPTY
When a neuron encounters an EMPTY address, it returns 0.5 (maximum entropy) — pure noise. Fewer bits = smaller address space = more cells actually trained.
Overnight Sweep Infrastructure
Built tooling for systematic architecture exploration:
- Per-tier metrics (accuracy broken down by tier)
- Skip-completed experiments (resume interrupted sweeps)
- Validation PPL on held-out data
- JSON output for downstream analysis
Per-Cluster Optimization
Moved from global GA/TS (same config for all clusters) to per-cluster optimization. Each of the 50K+ clusters can independently have different neuron counts and bit widths. This created a much larger search space but allowed fine-grained adaptation.
Hybrid CPU+GPU Evaluation
Extended the Rust accelerator with a hybrid mode that splits work between CPU (rayon, 16 cores) and GPU (Metal, 40 cores) simultaneously, providing ~2x additional speedup.
Best Tiered Results
The best tiered configuration was discovered during the overnight sweep (see Week 9):
5-tier with EMPTY=0.0:
Config: 50,15,20;50,13,18;400,9,10;20000,7,9;rest,5,8
| Tier | Tokens | Neurons | Bits | Coverage |
|---|---|---|---|---|
| 0 | 50 most frequent | 15 | 20 | ~30% of data |
| 1 | Next 50 | 13 | 18 | ~8% of data |
| 2 | Next 400 | 9 | 10 | ~13% of data |
| 3 | Next 20,000 | 7 | 9 | ~30% of data |
| 4 | Remaining ~30K | 5 | 8 | ~19% of data |
Key insight: EMPTY=0.0 (abstaining from vote) beats EMPTY=0.5 (maximum entropy) by 26% in CE. When a neuron encounters an untrained address, returning 0.0 effectively says “I have no opinion” — this lets other neurons dominate the prediction rather than adding noise.
Final metrics: CE ~10.20, PPL ~26,986, Acc ~4.86%
Next
The tier configuration space is large. Need a more structured search approach — which leads to the phased coarse-to-fine search developed next week.