Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM
本文为开源社区精选内容,由 unslothai 原创。 文中链接将跳转到原始仓库,部分图片可能加载较慢。
查看原始来源AI 导读
Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM! Train for Free Notebooks are beginner friendly. Read our guide. Add dataset, run, then deploy your trained model. Model Free...
Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM!

✨ Train for Free
Notebooks are beginner friendly. Read our guide. Add dataset, run, then deploy your trained model.
| Model | Free Notebooks | Performance | Memory use |
|---|---|---|---|
| Qwen3.5 (4B) | ▶️ Start for free | 1.5x faster | 60% less |
| gpt-oss (20B) | ▶️ Start for free | 2x faster | 70% less |
| gpt-oss (20B): GRPO | ▶️ Start for free | 2x faster | 80% less |
| Qwen3: Advanced GRPO | ▶️ Start for free | 2x faster | 50% less |
| Gemma 3 (4B) Vision | ▶️ Start for free | 1.7x faster | 60% less |
| embeddinggemma (300M) | ▶️ Start for free | 2x faster | 20% less |
| Mistral Ministral 3 (3B) | ▶️ Start for free | 1.5x faster | 60% less |
| Llama 3.1 (8B) Alpaca | ▶️ Start for free | 2x faster | 70% less |
| Llama 3.2 Conversational | ▶️ Start for free | 2x faster | 70% less |
| Orpheus-TTS (3B) | ▶️ Start for free | 1.5x faster | 50% less |
- See all our notebooks for: Kaggle, GRPO, TTS, embedding & Vision
- See all our models and all our notebooks
- See detailed documentation for Unsloth here
⚡ Quickstart
Linux or WSL
pip install unsloth
Windows
For Windows, pip install unsloth works only if you have Pytorch installed. Read our Windows Guide.
Docker
Use our official Unsloth Docker image unsloth/unsloth container. Read our Docker Guide.
AMD, Intel, Blackwell & DGX Spark
For RTX 50x, B200, 6000 GPUs: pip install unsloth. Read our guides for: Blackwell and DGX Spark.
To install Unsloth on AMD and Intel GPUs, follow our AMD Guide and Intel Guide.
🦥 Unsloth News
- Qwen3.5 - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. Guide + notebooks
- Train MoE LLMs 12x faster with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. Blog
- Embedding models: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. Blog • Notebooks
- New 7x longer context RL vs. all other setups, via our new batching algorithms. Blog
- New RoPE & MLP Triton Kernels & Padding Free + Packing: 3x faster training & 30% less VRAM. Blog
- 500K Context: Training a 20B model with >500K context is now possible on an 80GB GPU. Blog
- FP8 & Vision RL: You can now do FP8 & VLM GRPO on consumer GPUs. FP8 Blog • Vision RL
- Docker: Use Unsloth with no setup & environment issues with our new image. Guide • Docker image
- gpt-oss by OpenAI: Read our RL blog, Flex Attention blog and Guide.
Click for more news
- Quantization-Aware Training: We collabed with Pytorch, recovering ~70% accuracy. Read blog
- Memory-efficient RL: We're introducing even better RL. Our new kernels & algos allows faster RL with 50% less VRAM & 10× more context. Read blog
- Mistral 3: Run Ministral 3 or Devstral 2 and fine-tune with vision/RL sudoku notebooks. Guide • Notebooks
- Gemma 3n by Google: Read Blog. We uploaded GGUFs, 4-bit models.
- Text-to-Speech (TTS) is now supported, including
sesame/csm-1band STTopenai/whisper-large-v3. - Qwen3 is now supported. Qwen3-30B-A3B fits on 17.5GB VRAM.
- Introducing Dynamic 2.0 quants that set new benchmarks on 5-shot MMLU & Aider Polyglot.
- EVERYTHING is now supported - all models (TTS, BERT, Mamba), FFT, etc. MultiGPU is now supported. Enable FFT with
full_finetuning = True, 8-bit withload_in_8bit = True. - 📣 DeepSeek-R1 - run or fine-tune them with our guide. All model uploads: here.
- 📣 Introducing Long-context Reasoning (GRPO) in Unsloth. Train your own reasoning model with just 5GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs!
- 📣 Introducing Unsloth Dynamic 4-bit Quantization! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on Hugging Face here.
- 📣 Llama 4 by Meta, including Scout & Maverick are now supported.
- 📣 Phi-4 by Microsoft: We also fixed bugs in Phi-4 and uploaded GGUFs, 4-bit.
- 📣 Vision models now supported! Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) and Pixtral (12B) 2409
- 📣 Llama 3.3 (70B), Meta's latest model is supported.
- 📣 We worked with Apple to add Cut Cross Entropy. Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.
- 📣 We found and helped fix a gradient accumulation bug! Please update Unsloth and transformers.
- 📣 We cut memory usage by a further 30% and now support 4x longer context windows!
🔗 Links and Resources
| Type | Links |
|---|---|
| Join Reddit community | |
| 📚 Documentation & Wiki | Read Our Docs |
| Follow us on X | |
| 💾 Installation | Pip & Docker Install |
| 🔮 Our Models | Unsloth Catalog |
| ✍️ Blog | Read our Blogs |
⭐ Key Features
- Supports full-finetuning, pretraining, 4-bit, 16-bit and FP8 training
- Supports all models including TTS, multimodal, embedding and more! Any model that works in transformers, works in Unsloth.
- The most efficient library for Reinforcement Learning (RL), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc.
- 0% loss in accuracy - no approximation methods - all exact.
- Export and deploy your model to GGUF llama.cpp, vLLM, SGLang and Hugging Face.
- Supports NVIDIA (since 2018), AMD and Intel GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)
- Works on Linux, WSL and Windows
- All kernels written in OpenAI's Triton language. Manual backprop engine.
- If you trained a model with 🦥Unsloth, you can use this cool sticker!

💾 Install Unsloth
You can also see our docs for more detailed installation and updating instructions here.
Unsloth supports Python 3.13 or lower.
Pip Installation
Install with pip (recommended) for Linux devices:
pip install unsloth
To update Unsloth:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
See here for advanced pip install instructions.
Windows Installation
For this method, we will be utilizing Anaconda. You can view the full guide with screenshots here.
Install Miniconda (or Anaconda): Miniconda is recommended. Install Miniconda or Anaconda, then open Anaconda PowerShell Prompt to continue.
Create a Conda Environment: Create and activate a fresh Python 3.12 environment for Unsloth.
conda create --name unsloth_env python==3.12 -y conda activate unsloth_envCheck Your GPU and CUDA Version: Run
nvidia-smito confirm that your NVIDIA GPU is detected and note the CUDA version shown in the output. Ifnvidia-smidoes not work, reinstall the latest NVIDIA drivers.Install PyTorch: Install the Windows pip build of PyTorch that matches your CUDA version. Use Install PyTorch to select the correct command for your system, then verify that PyTorch can see your GPU.
import torch print(torch.cuda.is_available()) A = torch.ones((10, 10), device="cuda") B = torch.ones((10, 10), device="cuda") A @ BInstall Unsloth: Only install Unsloth after PyTorch is working correctly.
pip install unsloth
Advanced/Troubleshooting
For advanced installation instructions or if you see weird errors during installations:
First try using an isolated environment via then pip install unsloth
python -m venv unsloth
source unsloth/bin/activate
pip install unsloth
- Install
torchandtriton. Go to https://pytorch.org to install it. For examplepip install torch torchvision torchaudio triton - Confirm if CUDA is installed correctly. Try
nvcc. If that fails, you need to installcudatoolkitor CUDA drivers. - Install
xformersmanually via:
pip install ninja
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs and ignore `xformers`
- For GRPO runs, you can try installing
vllmand seeing ifpip install vllmsucceeds. - Double check that your versions of Python, CUDA, CUDNN,
torch,triton, andxformersare compatible with one another. The PyTorch Compatibility Matrix may be useful. - Finally, install
bitsandbytesand check it withpython -m bitsandbytes
Conda Installation (Optional)
⚠️Only use Conda if you have it. If not, use Pip. We support python=3.10,3.11,3.12,3.13.
conda create --name unsloth_env python==3.12 -y
conda activate unsloth_env
Use nvidia-smi to get the correct CUDA version like 13.0 which becomes cu130
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip3 install unsloth
If you're looking to install Conda in a Linux environment, read here, or run the below 🔽
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
Advanced Pip Installation
⚠️Do **NOT** use this if you have Conda. Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10 and CUDA versions.
For other torch versions, we support torch211, torch212, torch220, torch230, torch240, torch250, torch260, torch270, torch280, torch290, torch2100 and for CUDA versions, we support cu118 and cu121 and cu124. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere or cu121-ampere or cu124-ampere. Note: torch 2.10 only supports CUDA 12.6, 12.8, and 13.0.
For example, if you have torch 2.4 and CUDA 12.1, use:
pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
Another example, if you have torch 2.9 and CUDA 13.0, use:
pip install --upgrade pip
pip install "unsloth[cu130-torch290] @ git+https://github.com/unslothai/unsloth.git"
Another example, if you have torch 2.10 and CUDA 12.6, use:
pip install --upgrade pip
pip install "unsloth[cu126-torch2100] @ git+https://github.com/unslothai/unsloth.git"
And other examples:
pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
Or, run the below in a terminal to get the optimal pip installation command:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
Or, run the below manually in a Python REPL:
try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
import re
v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v < V('2.3.0'): x = 'cu{}{}-torch220'
elif v < V('2.4.0'): x = 'cu{}{}-torch230'
elif v < V('2.5.0'): x = 'cu{}{}-torch240'
elif v < V('2.5.1'): x = 'cu{}{}-torch250'
elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
elif v < V('2.7.0'): x = 'cu{}{}-torch260'
elif v < V('2.7.9'): x = 'cu{}{}-torch270'
elif v < V('2.8.0'): x = 'cu{}{}-torch271'
elif v < V('2.8.9'): x = 'cu{}{}-torch280'
elif v < V('2.9.1'): x = 'cu{}{}-torch290'
elif v < V('2.9.2'): x = 'cu{}{}-torch291'
elif v < V('2.10.1'): x = 'cu{}{}-torch2100'
else: raise RuntimeError(f"Torch = {v} too new!")
if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if v >= V('2.10.0') and cuda not in ("12.6", "12.8", "13.0"): raise RuntimeError(f"Torch 2.10 requires CUDA 12.6, 12.8, or 13.0! Got CUDA = {cuda}")
x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')
Docker Installation
You can use our pre-built Docker container with all dependencies to use Unsloth instantly with no setup required. Read our guide.
This container requires installing NVIDIA's Container Toolkit.
docker run -d -e JUPYTER_PASSWORD="mypassword" \
-p 8888:8888 -p 2222:22 \
-v $(pwd)/work:/workspace/work \
--gpus all \
unsloth/unsloth
Access Jupyter Lab at http://localhost:8888 and start fine-tuning!
📜 Documentation
- Go to our official Documentation for running models, saving to GGUF, checkpointing, evaluation and more!
- Read our Guides for: Fine-tuning, Reinforcement Learning, Text-to-Speech (TTS), Vision and any model.
- We support Huggingface's transformers, TRL, Trainer, Seq2SeqTrainer and Pytorch code.
Unsloth example code to fine-tune gpt-oss-20b:
from unsloth import FastLanguageModel, FastModel, FastVisionModel
import torch
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gpt-oss-20b",
max_seq_length = max_seq_length, # Choose any for long context!
load_in_4bit = True, # 4-bit quantization. False = 16-bit LoRA.
load_in_8bit = False, # 8-bit quantization
load_in_16bit = False, # 16-bit LoRA
full_finetuning = False, # Use for full fine-tuning.
trust_remote_code = False, # Enable to support new models
# token = "hf_...", # use one if using gated models
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
tokenizer = tokenizer,
args = SFTConfig(
max_seq_length = max_seq_length,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://unsloth.ai/docs for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM or SGLang
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates
💡 Reinforcement Learning
RL including GRPO, GSPO, FP8 training, DrGRPO, DAPO, PPO, Reward Modelling, Online DPO all work with Unsloth.
Read our Reinforcement Learning Guide or our advanced RL docs for batching, generation & training parameters.
List of RL notebooks:
- gpt-oss GRPO notebook: Link
- FP8 Qwen3-8B GRPO notebook (L4): Link
- Qwen3-VL GSPO notebook: Link
- Advanced Qwen3 GRPO notebook: Link
- ORPO notebook: Link
- DPO Zephyr notebook: Link
- KTO notebook: Link
- SimPO notebook: Link
🥇 Performance Benchmarking
- For our most detailed benchmarks, read our Llama 3.3 Blog.
- Benchmarking of Unsloth was also conducted by 🤗Hugging Face.
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
| Model | VRAM | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 |
|---|---|---|---|---|---|
| Llama 3.3 (70B) | 80GB | 2x | >75% | 13x longer | 1x |
| Llama 3.1 (8B) | 80GB | 2x | >70% | 12x longer | 1x |
Context length benchmarks
Llama 3.1 (8B) max. context length
We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
| GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 |
|---|---|---|
| 8 GB | 2,972 | OOM |
| 12 GB | 21,848 | 932 |
| 16 GB | 40,724 | 2,551 |
| 24 GB | 78,475 | 5,789 |
| 40 GB | 153,977 | 12,264 |
| 48 GB | 191,728 | 15,502 |
| 80 GB | 342,733 | 28,454 |
Llama 3.3 (70B) max. context length
We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
| GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 |
|---|---|---|
| 48 GB | 12,106 | OOM |
| 80 GB | 89,389 | 6,916 |
深度加工(NotebookLM 生成)
基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客,用于多场景复用
PPT 大纲(5-8 张幻灯片) 点击展开
Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM — ppt
幻灯片 1:Unsloth 简介与核心优势
- 什么是 Unsloth:Unsloth 是一款专为大语言模型(LLM)高效微调而设计的开源库 [1]。
- 核心性能:相比传统微调方式,Unsloth 能将微调速度提升 1.5 到 2 倍,同时减少 50% 到 80% 的显存占用(VRAM) [1, 2]。
- 零精度损失:Unsloth 所有的性能提升均基于精确计算,没有使用任何近似方法,因此保证了 0% 的精度损失 [3]。
- 跨平台兼容性:支持 Linux、Windows 和 WSL 操作系统,且兼容 NVIDIA、AMD 和 Intel 等多平台 GPU [3, 4]。
幻灯片 2:广泛的模型与任务支持
- 主流大模型支持:全面支持 gpt-oss、DeepSeek、Gemma、Qwen 和 Llama (包括最新的 Llama 3.3) 等主流模型 [1, 5]。
- 多模态与多任务:支持所有能在 Transformers 中运行的模型,涵盖文本、多模态(视觉)、嵌入(Embedding)以及文本到语音(TTS)模型 [3, 6]。
- 全方位训练模式:支持全参数微调、预训练,以及 4-bit、16-bit 和最新的 FP8 精度训练 [3]。
- 新手友好:提供大量免费的在线 Notebooks(如 Kaggle),只需“添加数据集、运行并部署”,即可轻松上手 [1, 2]。
幻灯片 3:突破性的底层优化技术
- 动态 4-bit 量化:创新地采用动态保留特定参数不被量化的策略,在大幅提高模型精度的同时,显存占用仅比传统 4-bit 增加不到 10% [5]。
- 底层内核优化:所有内核均使用 OpenAI 的 Triton 语言重写,包含新的 RoPE 和 MLP 内核,实现了更高效的手动反向传播 [4, 7]。
- 免填充与打包技术:通过免填充(Padding Free)与数据打包技术(Packing),使训练速度提升 3 倍并进一步减少 30% 显存 [7]。
- 显存管理优化:在 SFTTrainer 中通过配置
use_gradient_checkpointing = "unsloth",能额外减少 30% 显存并支持 2 倍的批处理大小 [8]。
幻灯片 4:强化学习(RL)与超长上下文推理
- 最强强化学习库:Unsloth 是目前最高效的 RL 库,支持 GRPO、PPO、DPO、GSPO 等多种算法,最高可降低 80% 的 RL 显存需求 [3, 8]。
- 低门槛推理模型训练:引入长上下文推理(GRPO)技术,仅需 5GB 显存即可将 Llama、Phi 或 Mistral 等转化为强大的推理大模型 [5]。
- 超长上下文微调:借助全新的批处理算法,在 80GB 显存的 GPU 上可以实现对 20B 模型超过 500K 上下文的训练 [7]。
- FP8 与视觉 RL:支持在消费级显卡上直接进行 FP8 训练以及视觉语言模型(VLM)的 GRPO 训练 [7]。
幻灯片 5:性能基准测试与对比(vs HF + FA2)
- Llama 3.3 (70B) 表现:在 80GB A100 上,Unsloth 微调速度快 2 倍,节省超 75% 显存,且支持高达 89K 的上下文(是传统方案的 13 倍) [5, 8]。
- Llama 3.1 (8B) 表现:同样在 80GB 显存上,上下文支持长度高达 342K(突破了原生的 128K 限制),且速度快 2 倍 [5, 8]。
- 小显存极限测试:在仅有 16GB 显存的 GPU 上,Unsloth 仍能使 Llama 3.1 (8B) 支持超 40,000 的上下文长度,而传统方案(HF + FA2)仅能支持约 2500 [8]。
幻灯片 6:安装部署与生态集成
- 灵活的安装方式:推荐在 Linux 上直接使用 Pip 安装,同时也支持通过 Anaconda 创建隔离环境以解决依赖问题 [4, 9, 10]。
- 一键 Docker 部署:提供官方 Docker 镜像(unsloth/unsloth),预装所有依赖,无缝对接 Jupyter Lab 进行开箱即用的训练 [2, 11]。
- 便捷的模型部署:训练完成的模型可以直接导出为 GGUF 格式,并完美兼容 llama.cpp、vLLM、SGLang 和 Hugging Face 等推理部署生态 [3]。
博客摘要 + 核心看点 点击展开
Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM — summary
想要在有限显存下高效微调大语言模型?Unsloth 是一款革命性的开源工具,能够将 DeepSeek、Llama 3.3、Gemma 等主流大模型的微调速度提升 2 倍,并大幅减少 70% 的显存消耗 [1, 2]。它不仅全面支持全量微调与动态 4-bit 量化,还完美兼容视觉(Vision)、文本到语音(TTS)等多模态任务 [2, 3]。在强化学习(RL)方面,Unsloth 甚至能节省高达 80% 的 VRAM [3]。通过极致的底层显存优化,开发者现在甚至可以在单张 80GB GPU 上实现高达 342K 超长上下文的 Llama 3.1 训练 [2, 4]。无论您使用 NVIDIA、AMD 还是 Intel GPU,该工具都能实现零精度损失的极速训练体验 [3]。
核心看点:
- 极致效能提升:大模型微调速度翻倍,显存占用直降 70%,让消费级显卡也能轻松驾驭大模型训练 [1, 2]。
- 突破长上下文极限:优化显存利用率,单张 80GB GPU 即可支持高达 342K 上下文的 Llama 3.1 微调 [2, 4]。
- 多模态与 RL 全面兼容:无损支持视觉、TTS 模型,以及最高可省 80% 显存的 GRPO 等强化学习算法 [3, 5]。
60 秒短视频脚本 点击展开
Unsloth: 2x Faster Fine-Tuning with 70% Less VRAM — video
大模型微调太吃显存?用这招![1]
Unsloth让微调提速两倍,显存直降七成,且完全无损精度![1, 2]
仅需5G显存就能训练推理模型,上下文窗口更暴增十几倍![3, 4]
内置免费代码本一键运行,对新手极友好,轻松导出部署![1, 2]
快去试试 Unsloth,轻松打造专属你的大模型吧!
课后巩固
与本文内容匹配的闪卡与测验,帮助巩固所学知识
延伸阅读
根据本文主题,为你推荐相关的学习资料


