Week 4 – RAM Transformer and the Big Sprint
The most productive week yet. Built the complete RAM Transformer architecture, comprehensive benchmarks, and started the language model experiments.
Summary
This week saw 108 commits — a complete architecture sprint that transformed the project from a toy parity checker into a real research platform. The RAM Transformer, multiple attention mechanisms, benchmark suite, and the first language model experiments all landed.
RAM Transformer Architecture
Built a complete transformer-style architecture using RAM neurons:
- RAMTransformerBlock: Attention + FFN with XOR residual connections
- Multiple attention variants: SoftRAMAttention (learned), PositionOnlyAttention (computed, 100% generalization), ComputedSortingAttention, ComputedMinMaxAttention
- Cross-attention: RAMCrossAttention for encoder-decoder models
- FFN variants: Including computed operations (increment, ROT13, Caesar cipher) that achieve 100% generalization
The key distinction is between learned operations (limited to trained patterns) and computed operations (100% generalization via algorithmic implementation).
Comprehensive Benchmarks
Tested across a wide range of tasks:
| Task | Accuracy | Notes |
|---|---|---|
| bAbI story understanding | 100% | Simple QA from stories |
| Theorem proving | 100% | Logical deduction |
| Code completion | 100% | Pattern-based |
| Sorting | 100% | Computed attention |
| Arithmetic | 100% | Computed FFN |
| SCAN/ListOps | Partial | Compositional generalization harder |
| Language modeling | 79% | First attempt, simple setup |
Language Model v2
Started the ram_lm_v2 benchmark — the first real attempt at WikiText-2 language modeling with RAM neurons. Key components:
- GPT-2 tokenizer (50,257 vocab)
- Cluster-based output (neurons per token)
- Perplexity and cross-entropy scoring
- GA/TS connectivity optimization
Initial results were far from transformer-level but established the evaluation framework.
Rust+Metal Accelerator
The Python evaluation was too slow for population-based optimization (50 genomes x full WikiText-2). Built a Rust accelerator with PyO3 bindings:
- rayon for CPU parallelism (16 cores)
- Metal compute shaders for GPU evaluation (40 cores)
- 822x speedup over pure Python for batch evaluation
This made overnight optimization runs feasible.
Other Notable Additions
- Kneser-Ney smoothing, BPE tokenizer support
- Contrastive learning and curriculum training
- Sparse memory backend for high-bit neurons
- Overfitting detection in evaluation
Evaluation Metrics
The standard metrics for language model evaluation, used throughout this research:
Cross-Entropy (CE): \(CE = -\frac{1}{N}\sum_{i=1}^{N} \log P(\text{token}_i \mid \text{context}_i)\) — measures average surprise per token in nats. Lower = better predictions. This is the fundamental objective: a model that assigns higher probability to the correct next token achieves lower CE.
Perplexity (PPL): \(PPL = e^{CE}\) — the exponential of CE. Intuitively, the “effective vocabulary size” the model is uncertain over. A PPL of 100 means the model is as confused as if choosing uniformly among 100 tokens. PPL is the standard reporting metric in language modeling.
Accuracy (Acc): Top-1 next-token accuracy = fraction where \(\arg\max_t P(t \mid \text{context}) = \text{target}\). Note: accuracy is a coarser metric than CE/PPL — a model can have good CE (well-calibrated probabilities) with low accuracy (the correct token isn’t the top prediction but still gets reasonable probability).
These metrics derive from information theory (Shannon 1948). Perplexity was introduced as a language model evaluation metric by Jelinek & Mercer (1980).
GPT-2 Baselines
GPT-2 (Radford et al. 2019) is our target benchmark — a family of transformer language models from OpenAI, evaluated zero-shot on WikiText-2 with the GPT-2 BPE tokenizer (50,257 vocab).
| Model | Params | PPL | CE (ln PPL) |
|---|---|---|---|
| GPT-2 Small | 124M | 29.41 | 3.38 |
| GPT-2 Medium | 355M | 22.76 | 3.12 |
| GPT-2 Large | 774M | 19.93 | 2.99 |
| GPT-2 XL | 1.5B | 18.34 | 2.91 |
These are zero-shot results — the model was NOT trained on WikiText-2. Accuracy is not reported in the original paper; PPL is the standard metric. These numbers represent the “goal” for our WNN architecture.
Random Baseline
A model that assigns uniform probability to all tokens: \(P(t) = \frac{1}{|V|}\) where \(|V| = 50{,}257\).
Derived from first principles:
- \(CE = -\frac{1}{N}\sum \log P(t_i) = -\log\frac{1}{|V|} = \ln(50{,}257) \approx 10.82\)
- \(PPL = e^{CE} = |V| = 50{,}257\)
- \(Acc = \frac{1}{|V|} \approx 0.002\%\)
This is the worst case for a non-degenerate model — it has learned nothing about the language and assigns equal probability to every token. Any model that captures even basic patterns (e.g., common words are more likely) should beat this. In information theory, this corresponds to maximum entropy over the vocabulary (Shannon 1948).
Next
With the infrastructure in place, the focus shifts to architecture search — finding the right neuron counts, bit widths, and connectivity patterns for language modeling.