TRL - Transformer Reinforcement Learning

TRL - Transformer Reinforcement Learning (Transformer强化学习)
TRL Banner


A comprehensive library to post-train foundation models

A comprehensive library to post-train foundation models (一个用于后训练基础模型的综合库)

🎉 What's New

🎉 What's New (🎉 最新消息)

OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows.

OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows. (OpenEnv集成: TRL 现在支持 OpenEnv,这是 Meta 的开源框架,用于在强化学习和智能体工作流程中定义、部署环境并与之交互。)

Overview

Overview (概述)

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT) (监督式微调), Group Relative Policy Optimization (GRPO) (组相对策略优化), and Direct Preference Optimization (DPO) (直接偏好优化). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups. (TRL 是一个尖端的库,旨在利用高级技术(如监督式微调 (SFT)、组相对策略优化 (GRPO) 和直接偏好优化 (DPO))对基础模型进行后训练。TRL 构建于 🤗 Transformers 生态系统之上,支持各种模型架构和模态,并且可以跨各种硬件设置进行扩展。)

Highlights

Highlights (亮点)
  • Trainers: Various fine-tuning methods are easily accessible via trainers like SFTTrainer, GRPOTrainer, DPOTrainer, RewardTrainer and more.

  • Trainers: Various fine-tuning methods are easily accessible via trainers like SFTTrainer, GRPOTrainer, DPOTrainer, RewardTrainer and more. (训练器: 通过 SFTTrainer、GRPOTrainer、DPOTrainer、RewardTrainer 等训练器可以轻松访问各种微调方法。)
  • Efficient and scalable:

    • Leverages 🤗 Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed.
    • Efficient and scalable: (高效且可扩展:) Leverages 🤗 Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed. (利用 🤗 Accelerate 使用诸如 DDP 和 DeepSpeed 之类的方法从单个 GPU 扩展到多节点集群。)
  • Full integration with 🤗 PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA.
  • Full integration with 🤗 PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA. (与 🤗 PEFT 的完全集成使能够通过量化和 LoRA/QLoRA 在适度的硬件上训练大型模型。)
  • Integrates 🦥 Unsloth for accelerating training using optimized kernels.
  • Integrates 🦥 Unsloth for accelerating training using optimized kernels. (集成 🦥 Unsloth,使用优化的内核加速训练。)
  • Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code.

  • Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code. (命令行界面 (CLI): 一个简单的界面让您无需编写代码即可使用模型进行微调。)

    Installation

    Installation (安装)

    Python Package

    Python Package (Python包)

    Install the library using pip:

    Install the library using pip: (使用 pip 安装库:)
    pip install trl
    

    From source

    From source (从源代码)

    If you want to use the latest features before an official release, you can install TRL from source:

    If you want to use the latest features before an official release, you can install TRL from source: (如果您想在正式发布之前使用最新功能,您可以从源代码安装 TRL:)
    pip install git+https://github.com/huggingface/trl.git
    

    Repository

    Repository (仓库)

    If you want to use the examples you can clone the repository with the following command:

    If you want to use the examples you can clone the repository with the following command: (如果要使用示例,可以使用以下命令克隆存储库:)
    git clone https://github.com/huggingface/trl.git
    

    Quick Start

    Quick Start (快速开始)

    For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.

    For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP. (为了在训练中获得更大的灵活性和控制力,TRL 提供了专用的训练器类,用于在自定义数据集上对语言模型或 PEFT 适配器进行后训练。TRL 中的每个训练器都是 🤗 Transformers 训练器的一个轻量级包装器,并且原生支持分布式训练方法,如 DDP、DeepSpeed ZeRO 和 FSDP。)

    SFTTrainer

    SFTTrainer
    from trl import SFTTrainer
    from datasets import load_dataset
    
    dataset = load_dataset("trl-lib/Capybara", split="train")
    
    trainer = SFTTrainer(
        model="Qwen/Qwen2.5-0.5B",
        train_dataset=dataset,
    )
    trainer.train()
    

    GRPOTrainer

    GRPOTrainer
    from datasets import load_dataset
    from trl import GRPOTrainer
    from trl.rewards import accuracy_reward
    
    dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
    
    trainer = GRPOTrainer(
        model="Qwen/Qwen2.5-0.5B-Instruct",
        reward_funcs=accuracy_reward,
        train_dataset=dataset,
    )
    trainer.train()
    

    [!NOTE] For reasoning models, use the reasoning_accuracy_reward() function for better results.

    [!NOTE] For reasoning models, use the reasoning_accuracy_reward() function for better results. ([!注意]对于推理模型,使用 reasoning_accuracy_reward() 函数可以获得更好的结果。)

    DPOTrainer

    DPOTrainer

    DPOTrainer implements the popular Direct Preference Optimization (DPO) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer:

    DPOTrainer implements the popular Direct Preference Optimization (DPO) (直接偏好优化) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer: (DPOTrainer 实现了流行的直接偏好优化 (DPO) 算法,该算法用于对 Llama 3 和许多其他模型进行后训练。以下是如何使用 DPOTrainer 的基本示例:)
    from datasets import load_dataset
    from trl import DPOTrainer
    
    dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
    
    trainer = DPOTrainer(
        model="Qwen3/Qwen-0.6B",
        train_dataset=dataset,
    )
    trainer.train()
    

    RewardTrainer

    RewardTrainer
    from trl import RewardTrainer
    from datasets import load_dataset
    
    dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
    
    trainer = RewardTrainer(
        model="Qwen/Qwen2.5-0.5B-Instruct",
        train_dataset=dataset,
    )
    trainer.train()
    

    Command Line Interface (CLI)

    Command Line Interface (CLI) (命令行界面 (CLI))

    You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):

    You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO): (您可以使用 TRL 命令行界面 (CLI) 快速开始使用后训练方法,如监督式微调 (SFT) 或直接偏好优化 (DPO):)

    SFT:

    trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
        --dataset_name trl-lib/Capybara \
        --output_dir Qwen2.5-0.5B-SFT
    

    DPO:

    trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
        --dataset_name argilla/Capybara-Preferences \
        --output_dir Qwen2.5-0.5B-DPO 
    

    Read more about CLI in the relevant documentation section or use --help for more details.

    Read more about CLI in the relevant documentation section or use --help for more details. (在相关文档部分中阅读有关 CLI 的更多信息,或使用 --help 获取更多详细信息。)

    Development

    Development (开发)

    If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install:

    If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install: (如果您想为 trl 做出贡献或根据您的需要进行自定义,请务必阅读贡献指南,并确保您进行开发安装:)
    git clone https://github.com/huggingface/trl.git
    cd trl/
    pip install -e .[dev]
    

    Experimental

    Experimental (实验性的)

    A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice.

    A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice. (在 trl.experimental 下提供了一个最小的孵化区域,用于不稳定/快速发展的功能。那里的任何内容都可能在任何版本中更改或删除,恕不另行通知。)

    Example:

    Example: (例子:)
    from trl.experimental.new_trainer import NewTrainer