TRL: Transformer Reinforcement Learning (RLHF/DPO)

转载 HuggingFace

S 精选提升深度解析 | 约 3 分钟阅读更新于 2026-03-06

本文为开源社区精选内容，由 HuggingFace 原创。文中链接将跳转到原始仓库，部分图片可能加载较慢。

AI 导读

TRL - Transformer Reinforcement Learning A comprehensive library to post-train foundation models What's New OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for...

TRL - Transformer Reinforcement Learning

TRL - Transformer Reinforcement Learning (Transformer强化学习)

A comprehensive library to post-train foundation models

A comprehensive library to post-train foundation models (一个用于后训练基础模型的综合库)

🎉 What's New

🎉 What's New (🎉 最新消息)

OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows.

OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows. (OpenEnv集成： TRL 现在支持 OpenEnv，这是 Meta 的开源框架，用于在强化学习和智能体工作流程中定义、部署环境并与之交互。)

Overview

Overview (概述)

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT) (监督式微调), Group Relative Policy Optimization (GRPO) (组相对策略优化), and Direct Preference Optimization (DPO) (直接偏好优化). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups. (TRL 是一个尖端的库，旨在利用高级技术（如监督式微调 (SFT)、组相对策略优化 (GRPO) 和直接偏好优化 (DPO)）对基础模型进行后训练。TRL 构建于 🤗 Transformers 生态系统之上，支持各种模型架构和模态，并且可以跨各种硬件设置进行扩展。)

Highlights

Highlights (亮点)

Trainers: Various fine-tuning methods are easily accessible via trainers like SFTTrainer, GRPOTrainer, DPOTrainer, RewardTrainer and more.

Trainers: Various fine-tuning methods are easily accessible via trainers like SFTTrainer, GRPOTrainer, DPOTrainer, RewardTrainer and more. (训练器：通过 SFTTrainer、GRPOTrainer、DPOTrainer、RewardTrainer 等训练器可以轻松访问各种微调方法。)

Efficient and scalable:

Leverages 🤗 Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed.

Efficient and scalable: (高效且可扩展：) Leverages 🤗 Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed. (利用 🤗 Accelerate 使用诸如 DDP 和 DeepSpeed 之类的方法从单个 GPU 扩展到多节点集群。)

Full integration with 🤗 PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA.

Full integration with 🤗 PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA. (与 🤗 PEFT 的完全集成使能够通过量化和 LoRA/QLoRA 在适度的硬件上训练大型模型。)

Integrates 🦥 Unsloth for accelerating training using optimized kernels.

Integrates 🦥 Unsloth for accelerating training using optimized kernels. (集成 🦥 Unsloth，使用优化的内核加速训练。)

Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code.

Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code. (命令行界面 (CLI)：一个简单的界面让您无需编写代码即可使用模型进行微调。)

Installation

Installation (安装)

Python Package

Python Package (Python包)

Install the library using pip:

Install the library using pip: (使用 pip 安装库：)

pip install trl

From source

From source (从源代码)

If you want to use the latest features before an official release, you can install TRL from source:

If you want to use the latest features before an official release, you can install TRL from source: (如果您想在正式发布之前使用最新功能，您可以从源代码安装 TRL：)

pip install git+https://github.com/huggingface/trl.git

Repository

Repository (仓库)

If you want to use the examples you can clone the repository with the following command:

If you want to use the examples you can clone the repository with the following command: (如果要使用示例，可以使用以下命令克隆存储库：)

git clone https://github.com/huggingface/trl.git

Quick Start

Quick Start (快速开始)

For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP. (为了在训练中获得更大的灵活性和控制力，TRL 提供了专用的训练器类，用于在自定义数据集上对语言模型或 PEFT 适配器进行后训练。TRL 中的每个训练器都是 🤗 Transformers 训练器的一个轻量级包装器，并且原生支持分布式训练方法，如 DDP、DeepSpeed ZeRO 和 FSDP。)

`SFTTrainer`

SFTTrainer

from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

`GRPOTrainer`

GRPOTrainer

from datasets import load_dataset
from trl import GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

[!NOTE] For reasoning models, use the reasoning_accuracy_reward() function for better results.

[!NOTE] For reasoning models, use the reasoning_accuracy_reward() function for better results. ([!注意]对于推理模型，使用 reasoning_accuracy_reward() 函数可以获得更好的结果。)

`DPOTrainer`

DPOTrainer

DPOTrainer implements the popular Direct Preference Optimization (DPO) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer:

DPOTrainer implements the popular Direct Preference Optimization (DPO) (直接偏好优化) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer: (DPOTrainer 实现了流行的直接偏好优化 (DPO) 算法，该算法用于对 Llama 3 和许多其他模型进行后训练。以下是如何使用 DPOTrainer 的基本示例：)

from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen3/Qwen-0.6B",
    train_dataset=dataset,
)
trainer.train()

`RewardTrainer`

RewardTrainer

from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

Command Line Interface (CLI)

Command Line Interface (CLI) (命令行界面 (CLI))

You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):

You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO): (您可以使用 TRL 命令行界面 (CLI) 快速开始使用后训练方法，如监督式微调 (SFT) 或直接偏好优化 (DPO)：)

SFT:

trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT

DPO:

trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO

Read more about CLI in the relevant documentation section or use --help for more details.

Read more about CLI in the relevant documentation section or use --help for more details. (在相关文档部分中阅读有关 CLI 的更多信息，或使用 --help 获取更多详细信息。)

Development

Development (开发)

If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install:

If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install: (如果您想为 trl 做出贡献或根据您的需要进行自定义，请务必阅读贡献指南，并确保您进行开发安装：)

git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .[dev]

Experimental

Experimental (实验性的)

A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice.

A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice. (在 trl.experimental 下提供了一个最小的孵化区域，用于不稳定/快速发展的功能。那里的任何内容都可能在任何版本中更改或删除，恕不另行通知。)

Example:

Example: (例子：)

from trl.experimental.new_trainer import NewTrainer

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

TRL: Transformer Reinforcement Learning (RLHF/DPO) — ppt

这是一份基于您提供的 TRL (Transformer Reinforcement Learning) 文章生成的 PPT 大纲，共包含 6 张幻灯片。

幻灯片 1：TRL 简介 (Transformer Reinforcement Learning)

核心定位：TRL 是一个专为基础模型后训练（post-training）设计的前沿开源库 [1]。
技术基础：建立在 🤗 Transformers 生态系统之上，支持多种模型架构和模态 [1]。
支持算法：支持监督微调（SFT）、组相对策略优化（GRPO）和直接偏好优化（DPO）等高级技术 [1]。
最新集成：现已集成 Meta 的开源框架 OpenEnv，用于定义、部署和交互强化学习及智能体工作流中的环境 [1]。

幻灯片 2：核心亮点与性能优势

多样化训练器：提供 SFTTrainer、GRPOTrainer、DPOTrainer、RewardTrainer 等多种开箱即用的训练方法 [1]。
高效且可扩展：利用 🤗 Accelerate 实现从单 GPU 到多节点集群的扩展，支持 DDP 和 DeepSpeed 等分布式训练方法 [1]。
硬件要求友好：通过全面集成 🤗 PEFT，利用量化和 LoRA/QLoRA 技术，使用户能在普通硬件上训练大型模型 [2]。
内核优化加速：集成了 🦥 Unsloth 库，通过优化内核进一步加速模型训练过程 [2]。

幻灯片 3：安装方式与命令行工具 (CLI)

便捷安装：支持通过 Python 包管理器直接安装 (pip install trl) [2]。
源码与开发版：如果需要体验最新未发布的功能，支持通过 GitHub 仓库进行源码安装或克隆 [2]。
零代码微调：提供简单易用的命令行接口 (CLI)，无需编写代码即可快速启动模型微调任务 [2]。
CLI 实战示例：可以通过终端直接执行 trl sft 或 trl dpo 命令，并结合参数指定模型和数据集进行训练 [3]。

幻灯片 4：核心组件 1 - SFT 与 GRPO 训练器

灵活与分布式：TRL 的训练器对 🤗 Transformers 训练器进行了轻量级封装，原生支持 DDP、DeepSpeed ZeRO 和 FSDP 等分布式方法 [2, 4]。
监督微调 (SFT)：通过 SFTTrainer，用户只需几行代码（加载数据集、指定模型）即可启动训练流程 [4]。
组相对策略优化 (GRPO)：GRPOTrainer 支持结合自定义的奖励函数（如 accuracy_reward）对模型进行强化学习微调 [4]。
推理模型优化技巧：官方特别提示，针对推理模型，使用 reasoning_accuracy_reward() 函数可以获得更好的训练效果 [4]。

幻灯片 5：核心组件 2 - DPO 与 Reward 训练器

直接偏好优化 (DPO)：DPOTrainer 实现了主流的 DPO 算法，该算法曾被广泛用于 Llama 3 等众多知名模型的后训练中 [4]。
DPO 代码简练：只需加载诸如 ultrafeedback_binarized 等偏好数据集，即可将基础模型微调为更符合人类偏好的模型 [4]。
奖励模型训练：RewardTrainer 专门用于训练奖励模型，帮助构建强化学习中评估模型输出质量的核心组件 [3]。

幻灯片 6：开发者指南与实验性功能

参与贡献：官方欢迎开发者贡献代码或根据需求定制 TRL，并提供了完善的贡献指南 [3]。
开发环境配置：开发者可以通过克隆仓库并执行 pip install -e .[dev] 来快速完成开发环境的部署 [5]。
实验性功能区：在 trl.experimental 命名空间下提供了用于孵化不稳定/快速迭代功能的最简区域（如 NewTrainer） [5]。
风险提示：实验性区域的功能可能会在未发通知的情况下，在任何后续版本中被更改或直接移除 [5]。

博客摘要 + 核心看点点击展开

TRL: Transformer Reinforcement Learning (RLHF/DPO) — summary

SEO 博客摘要：
TRL 是用于基础模型后训练的先进开源库，全面支持 SFT、DPO 与 GRPO 等微调算法[1]。它最新集成了 Meta 的 OpenEnv 框架，并依托 🤗 Transformers 生态构建[1]。借助 PEFT 与 Unsloth 深度集成，开发者可在有限算力下高效微调大语言模型[2]。TRL 提供了便捷的命令行工具和多种定制化的 Trainer 类，是实现大模型强化学习与对齐的强大武器[2, 3]。

核心看点：