Week 10 – Bitwise Architecture and Plug-and-Play Gating
Introduced BitwiseRAMLM — a per-bit output language model with only 16 clusters that outperforms the 50K-cluster tiered architecture. Plus plug-and-play gating that works across both architectures.
Summary
The most significant architectural breakthrough of the project. Instead of 50K+ clusters (one per token), the BitwiseRAMLM uses just 16 clusters — one per output bit. This simple change addresses the fundamental data density problem that plagued the tiered architecture.
BitwiseRAMLM
The core insight: rather than predicting “which token comes next” directly (50K-way classification), predict each bit of the token’s binary encoding independently:
\[\log P(\text{token}=t) = \sum_{i=0}^{15} \left[ b_i(t) \cdot \log P_i + (1-b_i(t)) \cdot \log(1-P_i) \right]\]
where \(b_i(t)\) is the \(i\)-th bit of token \(t\)’s binary encoding and \(P_i = P(\text{bit}_i = 1 \mid \text{context})\).
Why this works better:
| Tiered (50K clusters) | Bitwise (16 clusters) | |
|---|---|---|
| Clusters | 50,257 | 16 |
| Training examples per cluster | ~20 (rare tokens) | ~150,000 (ALL examples) |
| Data density | Severely sparse for rare tokens | Every neuron sees everything |
| Address space utilization | Many EMPTY cells | Dense training |
4-State Memory Modes
Introduced QUAD memory modes for BitwiseRAMLM:
- TERNARY (mode 0): Original 3-state (FALSE/TRUE/EMPTY), majority vote
- QUAD_BINARY (mode 1): 4-state nudging with binary threshold (cell >= 2 means true)
- QUAD_WEIGHTED (mode 2): 4-state nudging with weighted confidence
The 4-state modes handle contradictory training examples gracefully — instead of the last writer wins, cells accumulate evidence.
7-Phase Optimization Pipeline
Built a complete optimization pipeline for BitwiseRAMLM:
- Grid search (neurons x bits landscape scan)
- GA neurons (bits fixed from grid search)
- TS neurons refinement
- GA bits (neurons fixed)
- TS bits refinement
- GA connections (architecture fixed)
- TS connections refinement
Neuron-Parallel Training + Metal CE
The Rust accelerator was extended for bitwise:
- Neuron-parallel training: Each of 16 clusters trains independently, enabling massive parallelism
- Metal GPU CE: The reconstruction matmul (16 bits x 50K tokens) runs on Metal
- 23x speedup for the full train+eval pipeline
Full-Rust BitwiseEvaluator
The BitwiseEvaluator supports heterogeneous per-cluster configs (different neurons and bits per cluster) with automatic Rust+Metal batch evaluation. 50 genomes are evaluated in parallel.
Plug-and-Play Gating
Extracted gating into a standalone GatingTrainer that works with both architectures:
- GatingMode.TOKEN_LEVEL: Universal, vocab_size gates (works with any architecture)
- GatingMode.BIT_LEVEL: Bitwise-specific, 16 gates that confidence-weight bit predictions
- GatingMode.DUAL_STAGE: Both — bit gating then token gating
The trainer is architecture-agnostic: pass cluster_order for tiered encoding, omit it for bitwise.
Routing Experiments
Also explored routing-based approaches:
- Deterministic routing using input-observable features
- Selective expert evaluation (4x speedup by skipping irrelevant experts)
- Input feature analysis for optimal routing strategies
Refactored GA/TS
Major cleanup of the optimization infrastructure:
OptimizationConfigbase class shared by GA and TS- Pluggable hooks for monitoring and intervention
- Unified
optimize()loops
Optimization Progression
The bitwise architecture was optimized through a multi-phase pipeline, starting from grid search and refining with metaheuristics.
Phase 1 — Grid search (top 5 configurations):
| Neurons | Bits | Memory Mode | CE | PPL | Acc |
|---|---|---|---|---|---|
| 7 | 14 | QUAD_WEIGHTED | 9.14 | 9,344 | 6.42% |
| 7 | 13 | QUAD_WEIGHTED | 9.16 | 9,541 | 6.38% |
| 8 | 14 | QUAD_WEIGHTED | 9.18 | 9,709 | 6.31% |
| 7 | 15 | QUAD_WEIGHTED | 9.19 | 9,834 | 6.28% |
| 6 | 14 | QUAD_WEIGHTED | 9.21 | 9,987 | 6.19% |
Phase progression:
| Phase | Method | Target | CE | PPL |
|---|---|---|---|---|
| 1 | Grid search | neurons × bits | 9.14 | 9,344 |
| 2 | GA neurons | bits fixed at 14 | 9.13 | 9,270 |
| 3 | TS connections | architecture fixed | 9.12 | 9,150 |
| 4 | GA bits | neurons fixed | 9.11 | ~9,033 |
Each phase narrows the search space: grid search finds the landscape, GA explores globally, TS refines locally.
Results
The bitwise architecture with QUAD_WEIGHTED memory is currently the best performer:
- CE ~9.11 (vs ~10.3 for best tiered) — from v3 GA bits optimization
- PPL ~9,000 (vs ~30,000 for best tiered)
- Accuracy ~6.4% (grid search baseline; drops to ~3.8% at best CE due to CE/accuracy tradeoff)
Still far from GPT-2’s ~3.4 CE / ~29 PPL, but the 3x PPL improvement over tiered (9K vs 30K) confirms that data density is the critical bottleneck for RAM-based language models.
Next
Continue optimizing the bitwise architecture. The gating mechanism is now ready for systematic testing across all three modes. The DUAL_STAGE mode (bit gating + token gating) is the most promising unexplored direction.