Titans: Learning to Memorize at Test Time - Paper Notes

2025-02-14 20:03

Core Problem:

Existing models like Transformers and linear recurrent networks (e.g., Mamba) face a trade-off: Transformers handle dependencies well but scale poorly with sequence length, while linear models are efficient but struggle with long-term retention due to shallow memory.

Solution - Titans Architecture:

Titans integrate three memory systems:

Short-term Memory: Attention for local context.
Long-term Memory (LMM): A neural module (MLP) that learns to memorize dynamically using:
- Surprise Metric: Updates memory based on gradient signals (unexpected inputs trigger updates).
- Momentum: Blends past and current information to avoid local optima.
- Forgetting Mechanism: Adaptive weight decay to discard outdated information.
Persistent Memory: Fixed parameters encoding task-specific knowledge (e.g., grammar rules).

Architecture Variants:

MAC (Memory as Context): Combines persistent memory, retrieved long-term memory, and current input for attention. Excels in long-range tasks.
MAG (Memory as Gate): Uses gating to fuse sliding-window attention and long-term memory. Efficient for streaming data.
MAL (Memory as Layer): Stacks memory before attention. Faster but less accurate.

Key Results:

Language Modeling: Titans reduce perplexity by 3-5% vs. Transformers/Mamba.
Needle-in-Haystack: 98.4% accuracy at 16K tokens, scaling to 2M+ tokens.
BABILong Benchmark: Outperforms GPT-4 and Llama3-70B in reasoning over long documents.
Time Series/DNA Modeling: State-of-the-art results (e.g., 0.162 MSE on ECL dataset).

Technical Innovations:

Parallel Training: Chunking and associative scans enable efficient processing.
Deep Memory: MLPs allow non-linear compression, outperforming linear models.

Limitations & Future Work:

Overhead from deep memory and hyperparameter tuning needs optimization.
Potential extensions: sparse memory structures, multimodal applications, hardware optimizations.

Conclusion:

Titans effectively balance efficiency and accuracy for long sequences by merging attention with adaptive long-term memory. The architecture’s flexibility (via MAC/MAG/MAL) and theoretical advantages position it as a robust framework for tasks requiring extensive context, from genomics to language modeling.