Easy, fast, and cheap LLM serving for everyone

简单、快速、廉价的 LLM(Large Language Model,大型语言模型)服务,人人可用

| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |

| 文档 (Documentation) | 博客 (Blog) | 论文 (Paper) | Twitter/X | 用户论坛 (User Forum) | 开发者 Slack (Developer Slack) |

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.

🔥 我们构建了一个 vllm 网站来帮助您开始使用 vllm。请访问 vllm.ai 了解更多信息。如需活动信息,请访问 vllm.ai/events 加入我们。

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM 是一个快速且易于使用的 LLM (Large Language Model,大型语言模型) 推理和服务库。

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM 最初由加州大学伯克利分校的天空计算实验室 (Sky Computing Lab) 开发,现已发展成为一个由学术界和工业界共同贡献的社区驱动项目。

vLLM is fast with:

vLLM 速度快的原因:
  • State-of-the-art serving throughput
  • 拥有最先进的服务吞吐量 (State-of-the-art serving throughput)
  • Efficient management of attention key and value memory with PagedAttention
  • 使用 PagedAttention 有效管理 attention key 和 value 内存
  • Continuous batching of incoming requests
  • 持续批量处理传入的请求 (Continuous batching of incoming requests)
  • Fast model execution with CUDA/HIP graph
  • 使用 CUDA/HIP graph 快速执行模型
  • Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
  • 量化 (Quantizations):GPTQ, AWQ, AutoRound, INT4, INT8 和 FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
  • 优化的 CUDA 内核 (CUDA kernels),包括与 FlashAttention 和 FlashInfer 的集成
  • Speculative decoding
  • 推测解码 (Speculative decoding)
  • Chunked prefill
  • 分块预填充 (Chunked prefill)

vLLM is flexible and easy to use with:

vLLM 灵活且易于使用:
  • Seamless integration with popular Hugging Face models
  • 与流行的 Hugging Face 模型无缝集成
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • 通过各种解码算法(包括并行采样 (parallel sampling)、集束搜索 (beam search) 等)实现高吞吐量服务
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • 支持用于分布式推理的张量 (Tensor)、流水线 (pipeline)、数据 (data) 和专家 (expert) 并行
  • Streaming outputs
  • 流式输出 (Streaming outputs)
  • OpenAI-compatible API server
  • 兼容 OpenAI 的 API 服务器 (OpenAI-compatible API server)
  • Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
  • 支持 NVIDIA GPU、AMD CPU 和 GPU、Intel CPU 和 GPU、PowerPC CPU、Arm CPU 和 TPU。 此外,还支持各种硬件插件,例如 Intel Gaudi、IBM Spyre 和华为 Ascend。
  • Prefix caching support
  • 前缀缓存支持 (Prefix caching support)
  • Multi-LoRA support
  • 多 LoRA 支持 (Multi-LoRA support)

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

vLLM 无缝支持 HuggingFace 上最流行的开源模型,包括:
  • Transformer-like LLMs (e.g., Llama)
  • 类 Transformer 的 LLM (Transformer-like LLMs)(例如,Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
  • 混合专家 LLM (Mixture-of-Expert LLMs)(例如,Mixtral、Deepseek-V2 和 V3)
  • Embedding Models (e.g., E5-Mistral)
  • 嵌入模型 (Embedding Models)(例如,E5-Mistral)
  • Multi-modal LLMs (e.g., LLaVA)
  • 多模态 LLM (Multi-modal LLMs)(例如,LLaVA)

Find the full list of supported models here.

在此处查找支持的模型的完整列表。

Contact Us

联系我们
  • For technical questions and feature requests, please use GitHub Issues
  • 如有技术问题和功能请求,请使用 GitHub Issues
  • For discussing with fellow users, please use the vLLM Forum
  • 如需与其他用户讨论,请使用 vLLM 论坛 (vLLM Forum)
  • For coordinating contributions and development, please use Slack
  • 如需协调贡献和开发,请使用 Slack
  • For security disclosures, please use GitHub's Security Advisories feature
  • 如需进行安全披露,请使用 GitHub 的安全公告功能 (Security Advisories feature)
  • For collaborations and partnerships, please contact us at [email protected]
  • 如需合作与伙伴关系,请通过 [email protected] 联系我们

Media Kit

媒体资料包 (Media Kit)
  • If you wish to use vLLM's logo, please refer to our media kit repo
  • 如果您想使用 vLLM 的徽标,请参阅我们的媒体资料包仓库 (media kit repo)