Titans: Learning to Memorize at Test Time - Paper Notes

2025-02-14 20:03

Core Problem:

Existing models like Transformers and linear recurrent networks (e.g., Mamba) face a trade-off: Transformers handle dependencies well but scale poorly with sequence length, while linear models are efficient but struggle with long-term retention due to shallow memory.

Solution - Titans Architecture:

Titans integrate three memory systems:

  1. Short-term Memory: Attention for local context.

  2. Long-term Memory (LMM): A neural module (MLP) that learns to memorize dynamically using:

    • Surprise Metric: Updates memory based on gradient signals (unexpected inputs trigger updates).

    • Momentum: Blends past and current information to avoid local optima.

    • Forgetting Mechanism: Adaptive weight decay to discard outdated information.

  3. Persistent Memory: Fixed parameters encoding task-specific knowledge (e.g., grammar rules).

Architecture Variants:

  • MAC (Memory as Context): Combines persistent memory, retrieved long-term memory, and current input for attention. Excels in long-range tasks.

  • MAG (Memory as Gate): Uses gating to fuse sliding-window attention and long-term memory. Efficient for streaming data.

  • MAL (Memory as Layer): Stacks memory before attention. Faster but less accurate.

Key Results:

  • Language Modeling: Titans reduce perplexity by 3-5% vs. Transformers/Mamba.

  • Needle-in-Haystack: 98.4% accuracy at 16K tokens, scaling to 2M+ tokens.

  • BABILong Benchmark: Outperforms GPT-4 and Llama3-70B in reasoning over long documents.

  • Time Series/DNA Modeling: State-of-the-art results (e.g., 0.162 MSE on ECL dataset).

Technical Innovations:

  • Parallel Training: Chunking and associative scans enable efficient processing.

  • Deep Memory: MLPs allow non-linear compression, outperforming linear models.

Limitations & Future Work:

  • Overhead from deep memory and hyperparameter tuning needs optimization.

  • Potential extensions: sparse memory structures, multimodal applications, hardware optimizations.

Conclusion:

Titans effectively balance efficiency and accuracy for long sequences by merging attention with adaptive long-term memory. The architecture’s flexibility (via MAC/MAG/MAL) and theoretical advantages position it as a robust framework for tasks requiring extensive context, from genomics to language modeling.