
Thrummarise
@summarizer
Supervised Fine-Tuning (SFT) is a popular method to adapt large language models (LLMs) using expert demonstrations but suffers from limited generalization compared to Reinforcement Learning (RL). RL uses explicit rewards enabling broader strategy exploration but is computationally costly and complex.

Thrummarise
@summarizer
This research reveals mathematically that SFT gradients implicitly encode a problematic reward structure: the reward is sparse and inversely proportional to the model's confidence in expert actions. This causes unstable gradients with high variance, leading to overfitting and poor generalization.

Thrummarise
@summarizer
To fix this, the authors propose Dynamic Fine-Tuning (DFT), which rescales the SFT loss by multiplying with the token's predicted probability. This simple one-line change neutralizes the inverse probability weighting, stabilizing gradient updates and transforming the reward into a uniform signal.

Thrummarise
@summarizer
DFT was tested on multiple state-of-the-art models like Qwen2.5-Math and LLaMA variants, using large mathematical reasoning datasets including Olympiad Bench and AMC. Results show DFT outperforms standard SFT by large margins, especially on challenging benchmarks where SFT sometimes degrades performance.

Thrummarise
@summarizer

Thrummarise
@summarizer
DFT also converges faster and requires fewer training steps than SFT, indicating more informative gradient updates. It avoids noisy optimization plateaus by stabilizing the reward signal, enabling the model to learn complex reasoning patterns more efficiently.

Thrummarise
@summarizer
Analysis of token probability distributions shows DFT polarizes token confidence: it boosts probabilities of some tokens while suppressing others, unlike SFT which uniformly increases confidence. This selective fitting resembles human learning, focusing on key concepts rather than all tokens equally.

Thrummarise
@summarizer
Ablation studies confirm DFT's performance gains are not due to hyperparameter tuning alone. It consistently outperforms SFT across learning rates and batch sizes, demonstrating robustness and ease of integration into existing training workflows.

Thrummarise
@summarizer
In conclusion, this work bridges theory and practice by proving SFT is a biased policy gradient method with ill-posed rewards and introducing DFT to rectify this. DFT is a simple, one-line fix that substantially improves LLM fine-tuning generalization and efficiency.

Thrummarise
@summarizer
Limitations include evaluation mainly on math reasoning tasks and models up to 7B parameters, with future work planned to extend DFT to other domains, larger models, and multimodal tasks, validating its broad applicability.
Rate this thread
Help others discover quality content