8/4/2025

Length Generalization Transfer in Transformers via Task Association

10 tweets
2 min read
avatar

Thrummarise

@summarizer

Transformer language models excel at generalizing beyond training data, but how they extrapolate to longer inputs remains unclear. This research introduces length generalization transfer: the ability to extend length generalization from one task to another related task when trained jointly.

avatar

Thrummarise

@summarizer

The study explores length generalization transfer across three algorithmic domains: arithmetic (e.g., reverse addition), string manipulation (e.g., case flipping and reversal), and maze navigation (e.g., shortest path and DFS trace). Training a main task on short inputs alongside an auxiliary task on longer inputs enables the main task to generalize to longer sequences.

avatar

Thrummarise

@summarizer

For example, in arithmetic tasks, a model trained on reverse addition with short inputs fails to generalize alone. But co-training with longer related tasks like reverse subtraction or carry detection allows it to extrapolate to longer inputs, demonstrating transfer of length generalization capabilities.

avatar

Thrummarise

@summarizer

Similarly, in string tasks, combining a main task like string copy with auxiliary tasks such as multi-query associative recall or case inversion leads to improved length extrapolation. In maze tasks, co-training shortest path and DFS trace tasks at different lengths enables mutual transfer of length generalization.

avatar

Thrummarise

@summarizer

Control experiments confirm that unrelated auxiliary tasks do not induce length generalization transfer, highlighting the importance of task relatedness. This suggests that transformers reuse computational structures when tasks share underlying algorithmic procedures.

avatar

Thrummarise

@summarizer

Remarkably, pretrained language models also exhibit length generalization transfer. Fine-tuning checkpoints from natural language pretraining on synthetic tasks shows improved extrapolation to longer inputs, indicating that pretraining builds reusable computational scaffolds aiding downstream generalization.

avatar

Thrummarise

@summarizer

Mechanistic analysis reveals that successful length generalization transfer correlates with the reuse of attention heads across tasks. Attention matrix similarity and head importance align with improved transfer, suggesting shared internal circuits enable extrapolation by association.

avatar

Thrummarise

@summarizer

The study also finds that rotary positional encoding (RoPE) enhances length generalization transfer compared to no positional encoding, supporting its use in modern transformers for better long-range generalization.

avatar

Thrummarise

@summarizer

However, training dynamics are unstable, with transfer performance varying significantly across random seeds. The transfer effect is strongest when the auxiliary task length is within a factor of two of the main task length, indicating a sweet spot for effective knowledge sharing.

avatar

Thrummarise

@summarizer

While promising, this work focuses on synthetic algorithmic tasks with clear length definitions. Extending these findings to more complex, real-world tasks involving hierarchical reasoning or multi-skill integration remains an open challenge for future research.

Rate this thread

Help others discover quality content

Ready to create your own threads?