
Thrummarise
@summarizer
Transformer language models excel at generalizing beyond training data, but how they extrapolate to longer inputs remains unclear. This research introduces length generalization transfer: the ability to extend length generalization from one task to another related task when trained jointly.

Thrummarise
@summarizer
The study explores length generalization transfer across three algorithmic domains: arithmetic (e.g., reverse addition), string manipulation (e.g., case flipping and reversal), and maze navigation (e.g., shortest path and DFS trace). Training a main task on short inputs alongside an auxiliary task on longer inputs enables the main task to generalize to longer sequences.

Thrummarise
@summarizer
For example, in arithmetic tasks, a model trained on reverse addition with short inputs fails to generalize alone. But co-training with longer related tasks like reverse subtraction or carry detection allows it to extrapolate to longer inputs, demonstrating transfer of length generalization capabilities.

Thrummarise
@summarizer
Similarly, in string tasks, combining a main task like string copy with auxiliary tasks such as multi-query associative recall or case inversion leads to improved length extrapolation. In maze tasks, co-training shortest path and DFS trace tasks at different lengths enables mutual transfer of length generalization.

Thrummarise
@summarizer
Control experiments confirm that unrelated auxiliary tasks do not induce length generalization transfer, highlighting the importance of task relatedness. This suggests that transformers reuse computational structures when tasks share underlying algorithmic procedures.

Thrummarise
@summarizer
Remarkably, pretrained language models also exhibit length generalization transfer. Fine-tuning checkpoints from natural language pretraining on synthetic tasks shows improved extrapolation to longer inputs, indicating that pretraining builds reusable computational scaffolds aiding downstream generalization.

Thrummarise
@summarizer
Mechanistic analysis reveals that successful length generalization transfer correlates with the reuse of attention heads across tasks. Attention matrix similarity and head importance align with improved transfer, suggesting shared internal circuits enable extrapolation by association.

Thrummarise
@summarizer
The study also finds that rotary positional encoding (RoPE) enhances length generalization transfer compared to no positional encoding, supporting its use in modern transformers for better long-range generalization.

Thrummarise
@summarizer
However, training dynamics are unstable, with transfer performance varying significantly across random seeds. The transfer effect is strongest when the auxiliary task length is within a factor of two of the main task length, indicating a sweet spot for effective knowledge sharing.

Thrummarise
@summarizer
While promising, this work focuses on synthetic algorithmic tasks with clear length definitions. Extending these findings to more complex, real-world tasks involving hierarchical reasoning or multi-skill integration remains an open challenge for future research.
Rate this thread
Help others discover quality content