Week 4 – RAM Transformer and the Big Sprint

Author

Luiz Garcia

Published

December 29, 2025

Doi
Abstract

The most productive week yet. Built the complete RAM Transformer architecture, comprehensive benchmarks, and started the language model experiments.

Summary

This week saw 108 commits — a complete architecture sprint that transformed the project from a toy parity checker into a real research platform. The RAM Transformer, multiple attention mechanisms, benchmark suite, and the first language model experiments all landed.

RAM Transformer Architecture

Built a complete transformer-style architecture using RAM neurons:

  • RAMTransformerBlock: Attention + FFN with XOR residual connections
  • Multiple attention variants: SoftRAMAttention (learned), PositionOnlyAttention (computed, 100% generalization), ComputedSortingAttention, ComputedMinMaxAttention
  • Cross-attention: RAMCrossAttention for encoder-decoder models
  • FFN variants: Including computed operations (increment, ROT13, Caesar cipher) that achieve 100% generalization

The key distinction is between learned operations (limited to trained patterns) and computed operations (100% generalization via algorithmic implementation).

Comprehensive Benchmarks

Tested across a wide range of tasks:

Task Accuracy Notes
bAbI story understanding 100% Simple QA from stories
Theorem proving 100% Logical deduction
Code completion 100% Pattern-based
Sorting 100% Computed attention
Arithmetic 100% Computed FFN
SCAN/ListOps Partial Compositional generalization harder
Language modeling 79% First attempt, simple setup

Language Model v2

Started the ram_lm_v2 benchmark — the first real attempt at WikiText-2 language modeling with RAM neurons. Key components:

  • GPT-2 tokenizer (50,257 vocab)
  • Cluster-based output (neurons per token)
  • Perplexity and cross-entropy scoring
  • GA/TS connectivity optimization

Initial results were far from transformer-level but established the evaluation framework.

Rust+Metal Accelerator

The Python evaluation was too slow for population-based optimization (50 genomes x full WikiText-2). Built a Rust accelerator with PyO3 bindings:

  • rayon for CPU parallelism (16 cores)
  • Metal compute shaders for GPU evaluation (40 cores)
  • 822x speedup over pure Python for batch evaluation

This made overnight optimization runs feasible.

Other Notable Additions

  • Kneser-Ney smoothing, BPE tokenizer support
  • Contrastive learning and curriculum training
  • Sparse memory backend for high-bit neurons
  • Overfitting detection in evaluation

Evaluation Metrics

The standard metrics for language model evaluation, used throughout this research:

  • Cross-Entropy (CE): \(CE = -\frac{1}{N}\sum_{i=1}^{N} \log P(\text{token}_i \mid \text{context}_i)\) — measures average surprise per token in nats. Lower = better predictions. This is the fundamental objective: a model that assigns higher probability to the correct next token achieves lower CE.

  • Perplexity (PPL): \(PPL = e^{CE}\) — the exponential of CE. Intuitively, the “effective vocabulary size” the model is uncertain over. A PPL of 100 means the model is as confused as if choosing uniformly among 100 tokens. PPL is the standard reporting metric in language modeling.

  • Accuracy (Acc): Top-1 next-token accuracy = fraction where \(\arg\max_t P(t \mid \text{context}) = \text{target}\). Note: accuracy is a coarser metric than CE/PPL — a model can have good CE (well-calibrated probabilities) with low accuracy (the correct token isn’t the top prediction but still gets reasonable probability).

These metrics derive from information theory (Shannon 1948). Perplexity was introduced as a language model evaluation metric by Jelinek & Mercer (1980).

GPT-2 Baselines

GPT-2 (Radford et al. 2019) is our target benchmark — a family of transformer language models from OpenAI, evaluated zero-shot on WikiText-2 with the GPT-2 BPE tokenizer (50,257 vocab).

Model Params PPL CE (ln PPL)
GPT-2 Small 124M 29.41 3.38
GPT-2 Medium 355M 22.76 3.12
GPT-2 Large 774M 19.93 2.99
GPT-2 XL 1.5B 18.34 2.91

These are zero-shot results — the model was NOT trained on WikiText-2. Accuracy is not reported in the original paper; PPL is the standard metric. These numbers represent the “goal” for our WNN architecture.

Random Baseline

A model that assigns uniform probability to all tokens: \(P(t) = \frac{1}{|V|}\) where \(|V| = 50{,}257\).

Derived from first principles:

  • \(CE = -\frac{1}{N}\sum \log P(t_i) = -\log\frac{1}{|V|} = \ln(50{,}257) \approx 10.82\)
  • \(PPL = e^{CE} = |V| = 50{,}257\)
  • \(Acc = \frac{1}{|V|} \approx 0.002\%\)

This is the worst case for a non-degenerate model — it has learned nothing about the language and assigns equal probability to every token. Any model that captures even basic patterns (e.g., common words are more likely) should beat this. In information theory, this corresponds to maximum entropy over the vocabulary (Shannon 1948).

Next

With the infrastructure in place, the focus shifts to architecture search — finding the right neuron counts, bit widths, and connectivity patterns for language modeling.

References

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.

Reuse

CC-BY-NC-SA-4.0