8/10/2025

Dynamic Fine-Tuning: Enhancing SFT Generalization via Reward Rectification

12 tweets

3 min read

Thrummarise

@summarizer

Supervised Fine-Tuning (SFT) is a popular method to adapt large language models (LLMs) using expert demonstrations but suffers from limited generalization compared to Reinforcement Learning (RL). RL uses explicit rewards enabling broader strategy exploration but is computationally costly and complex.

[1][2]

Thrummarise

@summarizer

This research reveals mathematically that SFT gradients implicitly encode a problematic reward structure: the reward is sparse and inversely proportional to the model's confidence in expert actions. This causes unstable gradients with high variance, leading to overfitting and poor generalization.

[3]

Thrummarise

@summarizer

To fix this, the authors propose Dynamic Fine-Tuning (DFT), which rescales the SFT loss by multiplying with the token's predicted probability. This simple one-line change neutralizes the inverse probability weighting, stabilizing gradient updates and transforming the reward into a uniform signal.

[3][4]

Thrummarise

@summarizer

DFT was tested on multiple state-of-the-art models like Qwen2.5-Math and LLaMA variants, using large mathematical reasoning datasets including Olympiad Bench and AMC. Results show DFT outperforms standard SFT by large margins, especially on challenging benchmarks where SFT sometimes degrades performance.

[5][6]

Thrummarise

@summarizer

For example, on Qwen2.5-Math-1.5B, DFT improved average accuracy by +15.66 points, nearly 6 times the gain from SFT. On difficult tasks like Olympiad Bench, DFT boosted accuracy by +11.20 points while SFT caused a drop. This demonstrates DFT's superior robustness and generalization.

[5][6]

Thrummarise

@summarizer

DFT also converges faster and requires fewer training steps than SFT, indicating more informative gradient updates. It avoids noisy optimization plateaus by stabilizing the reward signal, enabling the model to learn complex reasoning patterns more efficiently.

[6]

Thrummarise

@summarizer

Compared to Importance-Weighted SFT (iw-SFT), a concurrent method, DFT consistently shows better average accuracy and robustness across models and datasets. Unlike iw-SFT, DFT does not require a separate reference model, making training simpler and more efficient.

[6][7]

Thrummarise

@summarizer

In offline RL settings with reward signals, DFT surprisingly outperforms both offline methods like DPO and RFT and online RL algorithms such as PPO and GRPO. This highlights DFT as a practical, resource-efficient alternative to complex RL pipelines.

[7][8]

Thrummarise

@summarizer

Analysis of token probability distributions shows DFT polarizes token confidence: it boosts probabilities of some tokens while suppressing others, unlike SFT which uniformly increases confidence. This selective fitting resembles human learning, focusing on key concepts rather than all tokens equally.

[8][9]

Thrummarise

@summarizer

Ablation studies confirm DFT's performance gains are not due to hyperparameter tuning alone. It consistently outperforms SFT across learning rates and batch sizes, demonstrating robustness and ease of integration into existing training workflows.

[9]

Thrummarise

@summarizer

In conclusion, this work bridges theory and practice by proving SFT is a biased policy gradient method with ill-posed rewards and introducing DFT to rectify this. DFT is a simple, one-line fix that substantially improves LLM fine-tuning generalization and efficiency.

[10]

Thrummarise

@summarizer

Limitations include evaluation mainly on math reasoning tasks and models up to 7B parameters, with future work planned to extend DFT to other domains, larger models, and multimodal tasks, validating its broad applicability.

[10]

Rate this thread

Help others discover quality content

Ready to create your own threads?

Get Started Free