vLLM: Fast and Easy LLM Serving and Inference

转载 vllm-project

S 精选进阶深度解析 | 约 2 分钟阅读更新于 2026-03-06

本文为开源社区精选内容，由 vllm-project 原创。文中链接将跳转到原始仓库，部分图片可能加载较慢。

AI 导读

Easy, fast, and cheap LLM serving for everyone

简单、快速、廉价的 LLM（Large Language Model，大型语言模型）服务，人人可用

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.

🔥 我们构建了一个 vllm 网站来帮助您开始使用 vllm。请访问 vllm.ai 了解更多信息。如需活动信息，请访问 vllm.ai/events 加入我们。

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM 是一个快速且易于使用的 LLM (Large Language Model，大型语言模型) 推理和服务库。

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM 最初由加州大学伯克利分校的天空计算实验室 (Sky Computing Lab) 开发，现已发展成为一个由学术界和工业界共同贡献的社区驱动项目。

vLLM is fast with:

vLLM 速度快的原因：

State-of-the-art serving throughput

拥有最先进的服务吞吐量 (State-of-the-art serving throughput)

Efficient management of attention key and value memory with PagedAttention

使用 PagedAttention 有效管理 attention key 和 value 内存

Continuous batching of incoming requests

持续批量处理传入的请求 (Continuous batching of incoming requests)

Fast model execution with CUDA/HIP graph

使用 CUDA/HIP graph 快速执行模型

Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8

量化 (Quantizations)：GPTQ, AWQ, AutoRound, INT4, INT8 和 FP8

Optimized CUDA kernels, including integration with FlashAttention and FlashInfer

优化的 CUDA 内核 (CUDA kernels)，包括与 FlashAttention 和 FlashInfer 的集成

Speculative decoding

推测解码 (Speculative decoding)

Chunked prefill

分块预填充 (Chunked prefill)

vLLM is flexible and easy to use with:

vLLM 灵活且易于使用：

Seamless integration with popular Hugging Face models

与流行的 Hugging Face 模型无缝集成

High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

通过各种解码算法（包括并行采样 (parallel sampling)、集束搜索 (beam search) 等）实现高吞吐量服务

Tensor, pipeline, data and expert parallelism support for distributed inference

支持用于分布式推理的张量 (Tensor)、流水线 (pipeline)、数据 (data) 和专家 (expert) 并行

Streaming outputs

流式输出 (Streaming outputs)

OpenAI-compatible API server

兼容 OpenAI 的 API 服务器 (OpenAI-compatible API server)

Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.

支持 NVIDIA GPU、AMD CPU 和 GPU、Intel CPU 和 GPU、PowerPC CPU、Arm CPU 和 TPU。此外，还支持各种硬件插件，例如 Intel Gaudi、IBM Spyre 和华为 Ascend。

Prefix caching support

前缀缓存支持 (Prefix caching support)

Multi-LoRA support

多 LoRA 支持 (Multi-LoRA support)

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

vLLM 无缝支持 HuggingFace 上最流行的开源模型，包括：

Transformer-like LLMs (e.g., Llama)

类 Transformer 的 LLM (Transformer-like LLMs)（例如，Llama）

Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)

混合专家 LLM (Mixture-of-Expert LLMs)（例如，Mixtral、Deepseek-V2 和 V3）

Embedding Models (e.g., E5-Mistral)

嵌入模型 (Embedding Models)（例如，E5-Mistral）

Multi-modal LLMs (e.g., LLaVA)

多模态 LLM (Multi-modal LLMs)（例如，LLaVA）

Find the full list of supported models here.

在此处查找支持的模型的完整列表。

Contact Us

联系我们

For technical questions and feature requests, please use GitHub Issues

如有技术问题和功能请求，请使用 GitHub Issues

For discussing with fellow users, please use the vLLM Forum

如需与其他用户讨论，请使用 vLLM 论坛 (vLLM Forum)

For coordinating contributions and development, please use Slack

如需协调贡献和开发，请使用 Slack

For security disclosures, please use GitHub's Security Advisories feature

如需进行安全披露，请使用 GitHub 的安全公告功能 (Security Advisories feature)

For collaborations and partnerships, please contact us at [email protected]

如需合作与伙伴关系，请通过 [email protected] 联系我们

Media Kit

媒体资料包 (Media Kit)

If you wish to use vLLM's logo, please refer to our media kit repo

如果您想使用 vLLM 的徽标，请参阅我们的媒体资料包仓库 (media kit repo)

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

vLLM: Fast and Easy LLM Serving and Inference — ppt

幻灯片 1：vLLM 简介

vLLM 是一个快速且易于使用的大语言模型（LLM）推理和部署库 [1]。
该项目最初由加州大学伯克利分校（UC Berkeley）的 Sky Computing Lab 开发 [1]。
现已发展成为由学术界和工业界共同贡献的社区驱动型开源项目 [1]。

幻灯片 2：核心优势——极致的性能

提供业界领先（State-of-the-art）的模型服务与部署吞吐量 [1]。
引入 PagedAttention 技术，高效管理注意力机制的键值（Key and Value）内存 [1]。
支持对输入请求的连续批处理（Continuous batching）和分块预填充（Chunked prefill） [1]。
内置优化的 CUDA 内核，集成了 FlashAttention 与 FlashInfer，并支持推测性解码（Speculative decoding） [1]。

幻灯片 3：全面的量化支持与执行加速

借助 CUDA/HIP 计算图（graph）实现极速的模型执行 [1]。
广泛支持主流模型量化技术，包括 GPTQ、AWQ 和 AutoRound [1]。
兼容多种精度格式的推理运算，如 INT4、INT8 以及 FP8，满足不同场景的资源限制 [1]。

幻灯片 4：极佳的灵活性与易用性

能够与主流的 Hugging Face 模型库实现无缝集成，开箱即用 [1]。
提供了与 OpenAI 兼容的 API 服务器，并支持低延迟的流式输出（Streaming outputs） [1]。
具备高吞吐量的部署能力，支持并行采样（parallel sampling）、束搜索（beam search）等多种解码算法 [1]。
内置前缀缓存（Prefix caching）技术并全面支持多 LoRA（Multi-LoRA）微调模型部署 [1]。

幻灯片 5：强大的分布式架构与多硬件兼容

为分布式推理提供了张量、流水线、数据以及专家（expert）并行等全面的并行计算支持 [1]。
广泛覆盖各类底层硬件：包括 NVIDIA GPU、AMD CPU/GPU、Intel CPU/GPU、PowerPC CPU、Arm CPU 以及 TPU [1]。
支持多样化的硬件加速插件，例如 Intel Gaudi、IBM Spyre 和华为昇腾（Huawei Ascend）计算卡 [1]。

幻灯片 6：丰富的受支持模型类型

全面支持类 Transformer 的传统大语言模型（如 Llama 系列） [1]。
支持最新的混合专家（MoE）大模型架构（如 Mixtral、Deepseek-V2 和 V3） [1]。
扩展支持非文本类模型，包括嵌入模型（如 E5-Mistral）以及多模态大语言模型（如 LLaVA） [1]。

幻灯片 7：开源社区与技术支持网络

开发者可以通过 GitHub Issues 提交技术问题、功能请求和安全漏洞披露 [1]。
用户可通过专属的 vLLM 论坛和开发者 Slack 频道进行日常讨论和开发协调工作 [1]。
寻求商业合作与伙伴关系的团队可通过官方邮箱（[email protected]）直接联系项目组 [1]。

博客摘要 + 核心看点点击展开

vLLM: Fast and Easy LLM Serving and Inference — summary

SEO 友好博客摘要

想要实现快速、低成本的大语言模型（LLM）部署与推理？vLLM 是您的理想选择！作为一个源自加州大学伯克利分校的开源项目，vLLM 凭借其创新的 PagedAttention 技术，实现了行业领先的高吞吐量和极具效率的注意力键值显存管理[1]。它不仅支持连续批处理和多种量化技术（如 GPTQ、AWQ、FP8），还提供无缝对接 Hugging Face 热门模型的体验以及兼容 OpenAI 的 API 服务[1]。无论您使用的是 NVIDIA、AMD 还是各种 AI 加速硬件，vLLM 都能为您提供灵活、强大的分布式推理解决方案，助力大模型应用的高效落地[1]。

核心看点