Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM

转载 unslothai

S 精选提升深度解析 | 约 13 分钟阅读更新于 2026-03-06

本文为开源社区精选内容，由 unslothai 原创。文中链接将跳转到原始仓库，部分图片可能加载较慢。

AI 导读

Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM! Train for Free Notebooks are beginner friendly. Read our guide. Add dataset, run, then deploy your trained model. Model Free...

Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM!

使用减少 70% 的 VRAM，以 2 倍的速度训练 gpt-oss、DeepSeek、Gemma、Qwen 和 Llama！ (VRAM - Video Random Access Memory)

✨ Train for Free

✨ 免费训练

Notebooks are beginner friendly. Read our guide. Add dataset, run, then deploy your trained model.

Notebook 对初学者很友好。阅读我们的指南。添加数据集，运行，然后部署你训练好的模型。

Model	Free Notebooks	Performance	Memory use
Qwen3.5 (4B)	▶️ Start for free	1.5x faster	60% less
gpt-oss (20B)	▶️ Start for free	2x faster	70% less
gpt-oss (20B): GRPO	▶️ Start for free	2x faster	80% less
Qwen3: Advanced GRPO	▶️ Start for free	2x faster	50% less
Gemma 3 (4B) Vision	▶️ Start for free	1.7x faster	60% less
embeddinggemma (300M)	▶️ Start for free	2x faster	20% less
Mistral Ministral 3 (3B)	▶️ Start for free	1.5x faster	60% less
Llama 3.1 (8B) Alpaca	▶️ Start for free	2x faster	70% less
Llama 3.2 Conversational	▶️ Start for free	2x faster	70% less
Orpheus-TTS (3B)	▶️ Start for free	1.5x faster	50% less

See all our notebooks for: Kaggle, GRPO, TTS, embedding & Vision

查看我们所有的 Notebook，用于：Kaggle, GRPO, TTS (Text-to-Speech), embedding (嵌入) & Vision（视觉）

See all our models and all our notebooks

查看我们所有的模型和我们所有的 Notebook

See detailed documentation for Unsloth here

在这里查看 Unsloth 的详细文档

⚡ Quickstart

⚡ 快速开始

Linux or WSL

Linux 或 WSL (Windows Subsystem for Linux)

pip install unsloth

Windows

For Windows, pip install unsloth works only if you have Pytorch installed. Read our Windows Guide.

对于 Windows，只有在安装了 Pytorch 的情况下，pip install unsloth 才能工作。阅读我们的 Windows 指南。

Docker

Use our official Unsloth Docker image unsloth/unsloth container. Read our Docker Guide.

使用我们官方的 Unsloth Docker 镜像 unsloth/unsloth 容器。阅读我们的 Docker 指南。

AMD, Intel, Blackwell & DGX Spark

For RTX 50x, B200, 6000 GPUs: pip install unsloth. Read our guides for: Blackwell and DGX Spark.
To install Unsloth on AMD and Intel GPUs, follow our AMD Guide and Intel Guide.

对于 RTX 50x, B200, 6000 GPUs (Graphics Processing Units): pip install unsloth。阅读我们的指南：Blackwell 和 DGX Spark。要在 AMD 和 Intel GPU 上安装 Unsloth，请按照我们的 AMD 指南和 Intel 指南操作。

🦥 Unsloth News

🦥 Unsloth 新闻

Qwen3.5 - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. Guide + notebooks

现在支持 Qwen3.5 - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B。指南 + Notebook

Train MoE LLMs 12x faster with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. Blog

使用减少 35% 的 VRAM，以 12 倍的速度训练 MoE (Mixture of Experts) LLM (Large Language Models) - DeepSeek、GLM、Qwen 和 gpt-oss。博客

Embedding models: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. Blog • Notebooks

Embedding (嵌入) 模型：Unsloth 现在支持约 1.8-3.3 倍速的 embedding (嵌入) 微调。博客 • Notebook

New 7x longer context RL vs. all other setups, via our new batching algorithms. Blog

通过我们新的批处理算法，实现了比所有其他设置长 7 倍的上下文 RL (Reinforcement Learning)。博客

New RoPE & MLP Triton Kernels & Padding Free + Packing: 3x faster training & 30% less VRAM. Blog

新的 RoPE (Rotary Positional Embedding) & MLP (Multilayer Perceptron) Triton Kernels (内核) & Padding Free + Packing (填充免费 + 打包): 3 倍速的训练和 30% 的 VRAM 减少。博客

500K Context: Training a 20B model with >500K context is now possible on an 80GB GPU. Blog

500K 上下文：现在可以在 80GB GPU 上训练具有 >500K 上下文的 20B 模型。博客

FP8 & Vision RL: You can now do FP8 & VLM GRPO on consumer GPUs. FP8 Blog • Vision RL

FP8 (8-bit Floating Point) & Vision (视觉) RL (强化学习): 现在可以在消费级 GPU 上进行 FP8 (8-bit Floating Point) 和 VLM (Vision Language Model) GRPO。FP8 博客 • Vision RL

Docker: Use Unsloth with no setup & environment issues with our new image. Guide • Docker image

Docker：使用我们的新镜像，无需设置和环境问题即可使用 Unsloth。指南 • Docker 镜像

gpt-oss by OpenAI: Read our RL blog, Flex Attention blog and Guide.

OpenAI 的 gpt-oss：阅读我们的 RL (强化学习) 博客、Flex Attention (弹性注意力) 博客和指南。

Click for more news

Quantization-Aware Training: We collabed with Pytorch, recovering ~70% accuracy. Read blog

Quantization-Aware Training (量化感知训练)：我们与 Pytorch 合作，恢复了约 70% 的准确率。阅读博客

Memory-efficient RL: We're introducing even better RL. Our new kernels & algos allows faster RL with 50% less VRAM & 10× more context. Read blog

Memory-efficient (内存高效) RL (强化学习)：我们正在引入更好的 RL (强化学习)。我们的新内核 (kernels) 和算法允许更快的 RL (强化学习)，减少 50% 的 VRAM 并增加 10 倍的上下文。阅读博客

Mistral 3: Run Ministral 3 or Devstral 2 and fine-tune with vision/RL sudoku notebooks. Guide • Notebooks

Mistral 3：运行 Ministral 3 或 Devstral 2，并使用视觉/RL (强化学习) 数独 Notebook 进行微调。指南 • Notebook

Gemma 3n by Google: Read Blog. We uploaded GGUFs, 4-bit models.

Google 的 Gemma 3n：阅读博客。我们上传了 GGUF (GGUF (GPT-Generated Unified Files)) 和 4-bit 模型。

Text-to-Speech (TTS) is now supported, including sesame/csm-1b and STT openai/whisper-large-v3.

现在支持 Text-to-Speech (文本到语音) (TTS)，包括 sesame/csm-1b 和 STT openai/whisper-large-v3。

Qwen3 is now supported. Qwen3-30B-A3B fits on 17.5GB VRAM.

现在支持 Qwen3。Qwen3-30B-A3B 适用于 17.5GB VRAM。

Introducing Dynamic 2.0 quants that set new benchmarks on 5-shot MMLU & Aider Polyglot.

引入了 Dynamic 2.0 quants (动态 2.0 量化)，在 5-shot MMLU 和 Aider Polyglot 上设置了新的基准。

EVERYTHING is now supported - all models (TTS, BERT, Mamba), FFT, etc. MultiGPU is now supported. Enable FFT with full_finetuning = True, 8-bit with load_in_8bit = True.

现在支持所有内容 - 所有模型（TTS (文本到语音), BERT, Mamba）、FFT 等。现在支持 MultiGPU。使用 full_finetuning = True 启用 FFT，使用 load_in_8bit = True 启用 8-bit。

📣 DeepSeek-R1 - run or fine-tune them with our guide. All model uploads: here.

📣 DeepSeek-R1 - 使用我们的指南运行或微调它们。所有模型上传：这里。

📣 Introducing Long-context Reasoning (GRPO) in Unsloth. Train your own reasoning model with just 5GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs!

📣 在 Unsloth 中引入 Long-context Reasoning (长上下文推理) (GRPO)。仅需 5GB VRAM 即可训练您自己的推理模型。将 Llama、Phi、Mistral 等转换为推理 LLM (大型语言模型)！

📣 Introducing Unsloth Dynamic 4-bit Quantization! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on Hugging Face here.

📣 引入 Unsloth Dynamic 4-bit Quantization (Unsloth 动态 4 位量化)！我们动态地选择不对某些参数进行量化，这大大提高了准确性，同时仅使用比 BnB 4-bit 少 <10% 的 VRAM。请在此处查看我们在 Hugging Face 上的集合。

📣 Llama 4 by Meta, including Scout & Maverick are now supported.

📣 支持 Meta 的 Llama 4，包括 Scout 和 Maverick。

📣 Phi-4 by Microsoft: We also fixed bugs in Phi-4 and uploaded GGUFs, 4-bit.

📣 Microsoft 的 Phi-4：我们还修复了 Phi-4 中的错误，并上传了 GGUF (GGUF (GPT-Generated Unified Files)) 和 4-bit。

📣 Vision models now supported! Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) and Pixtral (12B) 2409

📣 现在支持视觉模型！Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) 和 Pixtral (12B) 2409

📣 Llama 3.3 (70B), Meta's latest model is supported.

📣 支持 Meta 的最新模型 Llama 3.3 (70B)。

📣 We worked with Apple to add Cut Cross Entropy. Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.

📣 我们与 Apple 合作添加了 Cut Cross Entropy (切割交叉熵)。Unsloth 现在在 80GB GPU 上支持 Meta 的 Llama 3.3 (70B) 的 89K 上下文 - 比 HF+FA2 长 13 倍。对于 Llama 3.1 (8B)，Unsloth 启用了 342K 上下文，超过了其本地 128K 支持。

📣 We found and helped fix a gradient accumulation bug! Please update Unsloth and transformers.

📣 我们发现并修复了一个梯度累积 (gradient accumulation) 的错误！请更新 Unsloth 和 transformers。

📣 We cut memory usage by a further 30% and now support 4x longer context windows!

📣 我们进一步削减了 30% 的内存使用量，现在支持 4 倍长的上下文窗口 (context windows)！

🔗 Links and Resources

🔗 链接和资源 (Links and Resources)

Type	Links
r/unsloth Reddit	Join Reddit community
📚 Documentation & Wiki	Read Our Docs
Twitter (aka X)	Follow us on X
💾 Installation	Pip & Docker Install
🔮 Our Models	Unsloth Catalog
✍️ Blog	Read our Blogs

⭐ Key Features

⭐ 主要特性 (Key Features)

Supports full-finetuning, pretraining, 4-bit, 16-bit and FP8 training

支持全微调 (full-finetuning)、预训练 (pretraining)、4-bit、16-bit 和 FP8 训练

Supports all models including TTS, multimodal, embedding and more! Any model that works in transformers, works in Unsloth.

支持所有模型，包括 TTS、多模态 (multimodal)、嵌入 (embedding) 等！任何在 transformers 中工作的模型都可以在 Unsloth 中工作。

The most efficient library for Reinforcement Learning (RL), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc.

最高效的强化学习 (Reinforcement Learning, RL) 库，使用少 80% 的 VRAM。支持 GRPO、GSPO、DrGRPO、DAPO 等。

0% loss in accuracy - no approximation methods - all exact.

0% 精度损失 - 没有近似方法 - 所有都精确。

Export and deploy your model to GGUF llama.cpp, vLLM, SGLang and Hugging Face.

将您的模型导出并部署到 GGUF llama.cpp、vLLM、SGLang 和 Hugging Face。

Supports NVIDIA (since 2018), AMD and Intel GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)

支持 NVIDIA (2018 年以来)、AMD 和 Intel GPU。最低 CUDA Capability 7.0 (V100、T4、Titan V、RTX 20、30、40x、A100、H100、L40 等)

Works on Linux, WSL and Windows

适用于 Linux、WSL 和 Windows

All kernels written in OpenAI's Triton language. Manual backprop engine.

所有内核都用 OpenAI 的 Triton 语言编写。手动反向传播引擎 (Manual backprop engine)。

If you trained a model with 🦥Unsloth, you can use this cool sticker!

如果你用 🦥Unsloth 训练了一个模型，你可以使用这个酷炫的贴纸！

💾 Install Unsloth

💾 安装 Unsloth

You can also see our docs for more detailed installation and updating instructions here.

您也可以在此处查看我们的文档，了解更详细的安装和更新说明。

Unsloth supports Python 3.13 or lower.

Unsloth 支持 Python 3.13 或更低版本。

Pip Installation

Pip 安装 (Pip Installation)

Install with pip (recommended) for Linux devices:

使用 pip (推荐) 安装 Linux 设备：

pip install unsloth

To update Unsloth:

更新 Unsloth：

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

See here for advanced pip install instructions.

有关高级 pip 安装说明，请参见此处。

Windows Installation

Windows 安装 (Windows Installation)

For this method, we will be utilizing Anaconda. You can view the full guide with screenshots here.

对于此方法，我们将使用 Anaconda。您可以在此处查看带有屏幕截图的完整指南。

Install Miniconda (or Anaconda): Miniconda is recommended. Install Miniconda or Anaconda, then open Anaconda PowerShell Prompt to continue.

安装 Miniconda（或 Anaconda）：推荐使用 Miniconda。安装 Miniconda 或 Anaconda，然后打开 Anaconda PowerShell Prompt 继续。

Create a Conda Environment: Create and activate a fresh Python 3.12 environment for Unsloth.

conda create --name unsloth_env python==3.12 -y
conda activate unsloth_env

创建 Conda 环境：为 Unsloth 创建并激活一个新的 Python 3.12 环境。 conda create --name unsloth_env python==3.12 -y conda activate unsloth_env

Check Your GPU and CUDA Version: Run nvidia-smi to confirm that your NVIDIA GPU is detected and note the CUDA version shown in the output. If nvidia-smi does not work, reinstall the latest NVIDIA drivers.

检查您的 GPU 和 CUDA 版本：运行 nvidia-smi 以确认检测到您的 NVIDIA GPU，并记下输出中显示的 CUDA 版本。如果 nvidia-smi 不起作用，请重新安装最新的 NVIDIA 驱动程序。

Install PyTorch: Install the Windows pip build of PyTorch that matches your CUDA version. Use Install PyTorch to select the correct command for your system, then verify that PyTorch can see your GPU.

import torch
print(torch.cuda.is_available())
A = torch.ones((10, 10), device="cuda")
B = torch.ones((10, 10), device="cuda")
A @ B

安装 PyTorch：安装与您的 CUDA 版本匹配的 PyTorch 的 Windows pip 构建版本。使用 Install PyTorch 选择适合您系统的正确命令，然后验证 PyTorch 是否可以看到您的 GPU。 import torch print(torch.cuda.is_available()) A = torch.ones((10, 10), device="cuda") B = torch.ones((10, 10), device="cuda") A @ B

Install Unsloth: Only install Unsloth after PyTorch is working correctly.

pip install unsloth

安装 Unsloth：只有在 PyTorch 正常工作后才安装 Unsloth。 pip install unsloth

Advanced/Troubleshooting

高级/故障排除 (Advanced/Troubleshooting)

For advanced installation instructions or if you see weird errors during installations:

有关高级安装说明，或者如果您在安装过程中看到奇怪的错误：

First try using an isolated environment via then pip install unsloth

首先尝试通过一个隔离环境 (isolated environment) 来执行 pip install unsloth

python -m venv unsloth
source unsloth/bin/activate
pip install unsloth

Install torch and triton. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton

安装 torch 和 triton。访问 https://pytorch.org 安装它。例如 pip install torch torchvision torchaudio triton

Confirm if CUDA is installed correctly. Try nvcc. If that fails, you need to install cudatoolkit or CUDA drivers.

确认 CUDA 是否已正确安装。尝试 nvcc。如果失败，您需要安装 cudatoolkit 或 CUDA 驱动程序。

Install xformers manually via:

通过以下方式手动安装 xformers：

pip install ninja
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs and ignore `xformers`

For GRPO runs, you can try installing vllm and seeing if pip install vllm succeeds.

对于 GRPO 运行，您可以尝试安装 vllm 并查看 pip install vllm 是否成功。

Double check that your versions of Python, CUDA, CUDNN, torch, triton, and xformers are compatible with one another. The PyTorch Compatibility Matrix may be useful.

仔细检查您的 Python、CUDA、CUDNN、torch、triton 和 xformers 版本是否彼此兼容。PyTorch Compatibility Matrix 可能会很有用。

Finally, install bitsandbytes and check it with python -m bitsandbytes

最后，安装 bitsandbytes 并使用 python -m bitsandbytes 检查它

Conda Installation (Optional)

Conda 安装（可选）(Conda Installation (Optional))

⚠️Only use Conda if you have it. If not, use Pip. We support python=3.10,3.11,3.12,3.13.

⚠️仅当您拥有 Conda 时才使用 Conda。如果没有，请使用 Pip。我们支持 python=3.10,3.11,3.12,3.13。

conda create --name unsloth_env python==3.12 -y
conda activate unsloth_env

Use nvidia-smi to get the correct CUDA version like 13.0 which becomes cu130

使用 nvidia-smi 获取正确的 CUDA 版本，例如 13.0，它变为 cu130

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip3 install unsloth

If you're looking to install Conda in a Linux environment, read here, or run the below 🔽

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Advanced Pip Installation

高级 Pip 安装 (Advanced Pip Installation)

⚠️Do **NOT** use this if you have Conda. Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10 and CUDA versions.

⚠️如果你安装了Conda，**不要**使用这个。Pip有点复杂，因为它存在依赖问题。对于torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10 和 CUDA 版本，pip 命令是不同的。

For other torch versions, we support torch211, torch212, torch220, torch230, torch240, torch250, torch260, torch270, torch280, torch290, torch2100 and for CUDA versions, we support cu118 and cu121 and cu124. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere or cu121-ampere or cu124-ampere. Note: torch 2.10 only supports CUDA 12.6, 12.8, and 13.0.

对于其他 torch (torch) 版本，我们支持 torch211, torch212, torch220, torch230, torch240, torch250, torch260, torch270, torch280, torch290, torch2100；对于 CUDA 版本，我们支持 cu118, cu121 和 cu124。对于 Ampere 架构的设备 (A100, H100, RTX3090) 及以上，请使用 cu118-ampere 或 cu121-ampere 或 cu124-ampere。注意：torch 2.10 仅支持 CUDA 12.6、12.8 和 13.0。

For example, if you have torch 2.4 and CUDA 12.1, use:

例如，如果你有 torch 2.4 和 CUDA 12.1，使用：

pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Another example, if you have torch 2.9 and CUDA 13.0, use:

另一个例子，如果你有 torch 2.9 和 CUDA 13.0，使用：

pip install --upgrade pip
pip install "unsloth[cu130-torch290] @ git+https://github.com/unslothai/unsloth.git"

Another example, if you have torch 2.10 and CUDA 12.6, use:

另一个例子，如果你有 torch 2.10 和 CUDA 12.6，使用：

pip install --upgrade pip
pip install "unsloth[cu126-torch2100] @ git+https://github.com/unslothai/unsloth.git"

And other examples:

以及其他例子：

pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

Or, run the below in a terminal to get the optimal pip installation command:

或者，在终端中运行以下命令以获取最佳的 pip (pip) 安装命令：

wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

Or, run the below manually in a Python REPL:

或者，在 Python REPL (Python REPL) 中手动运行以下命令：

try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
import re
v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
elif v  < V('2.9.1'): x = 'cu{}{}-torch290'
elif v  < V('2.9.2'): x = 'cu{}{}-torch291'
elif v  < V('2.10.1'): x = 'cu{}{}-torch2100'
else: raise RuntimeError(f"Torch = {v} too new!")
if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if v >= V('2.10.0') and cuda not in ("12.6", "12.8", "13.0"): raise RuntimeError(f"Torch 2.10 requires CUDA 12.6, 12.8, or 13.0! Got CUDA = {cuda}")
x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')

Docker Installation

Docker (Docker) 安装

You can use our pre-built Docker container with all dependencies to use Unsloth instantly with no setup required. Read our guide.

您可以使用我们预构建的 Docker (Docker) 容器，其中包含所有依赖项，以便立即使用 Unsloth，无需设置。阅读我们的指南。

This container requires installing NVIDIA's Container Toolkit.

此容器需要安装 NVIDIA 的 Container Toolkit (Container Toolkit)。

docker run -d -e JUPYTER_PASSWORD="mypassword" \
  -p 8888:8888 -p 2222:22 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter Lab at http://localhost:8888 and start fine-tuning!

在 http://localhost:8888 访问 Jupyter Lab (Jupyter Lab) 并开始微调！

📜 Documentation

📜 文档

Go to our official Documentation for running models, saving to GGUF, checkpointing, evaluation and more!

访问我们的官方文档 (Documentation) 以运行模型、保存为 GGUF、检查点、评估等！

Read our Guides for: Fine-tuning, Reinforcement Learning, Text-to-Speech (TTS), Vision and any model.

阅读我们的指南，了解：微调、强化学习、文本到语音 (TTS) (Text-to-Speech (TTS))、视觉和任何模型。

We support Huggingface's transformers, TRL, Trainer, Seq2SeqTrainer and Pytorch code.

我们支持 Huggingface (Huggingface) 的 transformers (transformers)、TRL (TRL)、Trainer (Trainer)、Seq2SeqTrainer (Seq2SeqTrainer) 和 Pytorch (Pytorch) 代码。

Unsloth example code to fine-tune gpt-oss-20b:

用于微调 gpt-oss-20b 的 Unsloth 示例代码：

from unsloth import FastLanguageModel, FastModel, FastVisionModel
import torch
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model

] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4-bit quantization. False = 16-bit LoRA.
    load_in_8bit = False, # 8-bit quantization
    load_in_16bit = False, # 16-bit LoRA
    full_finetuning = False, # Use for full fine-tuning.
    trust_remote_code = False, # Enable to support new models
    # token = "hf_...", # use one if using gated models
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    tokenizer = tokenizer,
    args = SFTConfig(
        max_seq_length = max_seq_length,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://unsloth.ai/docs for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM or SGLang
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

💡 Reinforcement Learning

💡 强化学习

RL including GRPO, GSPO, FP8 training, DrGRPO, DAPO, PPO, Reward Modelling, Online DPO all work with Unsloth.

包括 GRPO (GRPO)、GSPO (GSPO)、FP8 (FP8) 训练、DrGRPO (DrGRPO)、DAPO (DAPO)、PPO (PPO)、奖励建模 (Reward Modelling)、在线 DPO (Online DPO) 在内的 RL (RL) 都可以在 Unsloth 中使用。

Read our Reinforcement Learning Guide or our advanced RL docs for batching, generation & training parameters.

阅读我们的强化学习指南 (Reinforcement Learning Guide) 或我们的高级 RL (RL) 文档，了解批处理、生成和训练参数。

List of RL notebooks:

RL (RL) 笔记本列表：

gpt-oss GRPO notebook: Link

gpt-oss GRPO (GRPO) 笔记本：链接

FP8 Qwen3-8B GRPO notebook (L4): Link

FP8 (FP8) Qwen3-8B GRPO (GRPO) 笔记本 (L4)：链接

Qwen3-VL GSPO notebook: Link

Qwen3-VL GSPO (GSPO) 笔记本：链接

Advanced Qwen3 GRPO notebook: Link

高级 Qwen3 GRPO (GRPO) 笔记本：链接

ORPO notebook: Link

ORPO (ORPO) 笔记本：链接

DPO Zephyr notebook: Link

DPO (DPO) Zephyr 笔记本：链接

KTO notebook: Link

KTO (KTO) 笔记本：链接

SimPO notebook: Link

SimPO (SimPO) 笔记本：链接

🥇 Performance Benchmarking

🥇 性能基准测试

For our most detailed benchmarks, read our Llama 3.3 Blog.

有关我们最详细的基准测试，请阅读我们的 Llama 3.3 博客 (Llama 3.3 Blog)。

Benchmarking of Unsloth was also conducted by 🤗Hugging Face.

🤗Hugging Face (🤗Hugging Face) 也对 Unsloth 进行了基准测试。

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):

我们使用 Alpaca Dataset (Alpaca Dataset) 进行了测试，批量大小为 2，梯度累积步数为 4，rank = 32，并在所有线性层（q、k、v、o、gate、up、down）上应用了 QLoRA (QLoRA)：

Model	VRAM	🦥 Unsloth speed	🦥 VRAM reduction	🦥 Longer context	😊 Hugging Face + FA2
Llama 3.3 (70B)	80GB	2x	>75%	13x longer	1x
Llama 3.1 (8B)	80GB	2x	>70%	12x longer	1x

Context length benchmarks

上下文长度基准测试

Llama 3.1 (8B) max. context length

Llama 3.1 (8B) 最大上下文长度

We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

我们测试了 Llama 3.1 (8B) Instruct，并在所有线性层（Q、K、V、O、gate、up 和 down）上进行了 4bit QLoRA (QLoRA)，其中 rank = 32，批量大小为 1。我们将所有序列填充到某个最大序列长度，以模拟长上下文微调工作负载。

GPU VRAM	🦥Unsloth context length	Hugging Face + FA2
8 GB	2,972	OOM
12 GB	21,848	932
16 GB	40,724	2,551
24 GB	78,475	5,789
40 GB	153,977	12,264
48 GB	191,728	15,502
80 GB	342,733	28,454

Llama 3.3 (70B) max. context length

Llama 3.3 (70B) 最大上下文长度

We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

我们在 80GB A100 上测试了 Llama 3.3 (70B) Instruct，并在所有线性层（Q、K、V、O、gate、up 和 down）上进行了 4bit QLoRA (QLoRA)，其中 rank = 32，批量大小为 1。我们将所有序列填充到某个最大序列长度，以模拟长上下文微调工作负载。

GPU VRAM	🦥Unsloth context length	Hugging Face + FA2
48 GB	12,106	OOM
80 GB	89,389	6,916

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM — ppt

幻灯片 1：Unsloth 简介与核心优势

什么是 Unsloth：Unsloth 是一款专为大语言模型（LLM）高效微调而设计的开源库 [1]。
核心性能：相比传统微调方式，Unsloth 能将微调速度提升 1.5 到 2 倍，同时减少 50% 到 80% 的显存占用（VRAM） [1, 2]。
零精度损失：Unsloth 所有的性能提升均基于精确计算，没有使用任何近似方法，因此保证了 0% 的精度损失 [3]。
跨平台兼容性：支持 Linux、Windows 和 WSL 操作系统，且兼容 NVIDIA、AMD 和 Intel 等多平台 GPU [3, 4]。

幻灯片 2：广泛的模型与任务支持

主流大模型支持：全面支持 gpt-oss、DeepSeek、Gemma、Qwen 和 Llama (包括最新的 Llama 3.3) 等主流模型 [1, 5]。
多模态与多任务：支持所有能在 Transformers 中运行的模型，涵盖文本、多模态（视觉）、嵌入（Embedding）以及文本到语音（TTS）模型 [3, 6]。
全方位训练模式：支持全参数微调、预训练，以及 4-bit、16-bit 和最新的 FP8 精度训练 [3]。
新手友好：提供大量免费的在线 Notebooks（如 Kaggle），只需“添加数据集、运行并部署”，即可轻松上手 [1, 2]。

幻灯片 3：突破性的底层优化技术

动态 4-bit 量化：创新地采用动态保留特定参数不被量化的策略，在大幅提高模型精度的同时，显存占用仅比传统 4-bit 增加不到 10% [5]。
底层内核优化：所有内核均使用 OpenAI 的 Triton 语言重写，包含新的 RoPE 和 MLP 内核，实现了更高效的手动反向传播 [4, 7]。
免填充与打包技术：通过免填充（Padding Free）与数据打包技术（Packing），使训练速度提升 3 倍并进一步减少 30% 显存 [7]。
显存管理优化：在 SFTTrainer 中通过配置 use_gradient_checkpointing = "unsloth"，能额外减少 30% 显存并支持 2 倍的批处理大小 [8]。

幻灯片 4：强化学习（RL）与超长上下文推理

最强强化学习库：Unsloth 是目前最高效的 RL 库，支持 GRPO、PPO、DPO、GSPO 等多种算法，最高可降低 80% 的 RL 显存需求 [3, 8]。
低门槛推理模型训练：引入长上下文推理（GRPO）技术，仅需 5GB 显存即可将 Llama、Phi 或 Mistral 等转化为强大的推理大模型 [5]。
超长上下文微调：借助全新的批处理算法，在 80GB 显存的 GPU 上可以实现对 20B 模型超过 500K 上下文的训练 [7]。
FP8 与视觉 RL：支持在消费级显卡上直接进行 FP8 训练以及视觉语言模型（VLM）的 GRPO 训练 [7]。

幻灯片 5：性能基准测试与对比（vs HF + FA2）

Llama 3.3 (70B) 表现：在 80GB A100 上，Unsloth 微调速度快 2 倍，节省超 75% 显存，且支持高达 89K 的上下文（是传统方案的 13 倍） [5, 8]。
Llama 3.1 (8B) 表现：同样在 80GB 显存上，上下文支持长度高达 342K（突破了原生的 128K 限制），且速度快 2 倍 [5, 8]。
小显存极限测试：在仅有 16GB 显存的 GPU 上，Unsloth 仍能使 Llama 3.1 (8B) 支持超 40,000 的上下文长度，而传统方案（HF + FA2）仅能支持约 2500 [8]。

幻灯片 6：安装部署与生态集成

灵活的安装方式：推荐在 Linux 上直接使用 Pip 安装，同时也支持通过 Anaconda 创建隔离环境以解决依赖问题 [4, 9, 10]。
一键 Docker 部署：提供官方 Docker 镜像（unsloth/unsloth），预装所有依赖，无缝对接 Jupyter Lab 进行开箱即用的训练 [2, 11]。
便捷的模型部署：训练完成的模型可以直接导出为 GGUF 格式，并完美兼容 llama.cpp、vLLM、SGLang 和 Hugging Face 等推理部署生态 [3]。

博客摘要 + 核心看点点击展开

Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM — summary

想要在有限显存下高效微调大语言模型？Unsloth 是一款革命性的开源工具，能够将 DeepSeek、Llama 3.3、Gemma 等主流大模型的微调速度提升 2 倍，并大幅减少 70% 的显存消耗 [1, 2]。它不仅全面支持全量微调与动态 4-bit 量化，还完美兼容视觉（Vision）、文本到语音（TTS）等多模态任务 [2, 3]。在强化学习（RL）方面，Unsloth 甚至能节省高达 80% 的 VRAM [3]。通过极致的底层显存优化，开发者现在甚至可以在单张 80GB GPU 上实现高达 342K 超长上下文的 Llama 3.1 训练 [2, 4]。无论您使用 NVIDIA、AMD 还是 Intel GPU，该工具都能实现零精度损失的极速训练体验 [3]。

核心看点：