LM Evaluation Harness: A Framework for Few-Shot Evaluation

转载 EleutherAI

S 精选提升深度解析 | 约 23 分钟阅读更新于 2026-03-06

本文为开源社区精选内容，由 EleutherAI 原创。文中链接将跳转到原始仓库，部分图片可能加载较慢。

AI 导读

Language Model Evaluation Harness Latest News [2025/12] CLI refactored with subcommands (run, ls, validate) and YAML config file support via --config. See the CLI Reference and Configuration Guide....

Language Model Evaluation Harness

语言模型评估工具 (Language Model Evaluation Harness)

Latest News 📣

Announcement

A new v0.4.0 release of lm-evaluation-harness is available !

lm-evaluation-harness 的新 v0.4.0 版本已发布！

New updates and features include:

新的更新和功能包括：

New Open LLM Leaderboard tasks have been added ! You can find them under the leaderboard task group.

新的 Open LLM Leaderboard 任务已经添加！你可以在排行榜任务组 (leaderboard task group) 下找到它们。

Internal refactoring

内部重构

Config-based task creation and configuration

基于配置的任务创建和配置

Easier import and sharing of externally-defined task config YAMLs

更容易导入和共享外部定义的任务配置 YAML 文件

Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource

支持 Jinja2 提示设计，轻松修改提示 + 从 Promptsource 导入提示

More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more

更高级的配置选项，包括输出后处理、答案提取和每个文档的多个 LM 生成，可配置的 fewshot 设置等等

Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more

速度提升和新的模型库支持，包括：更快的数据并行 HF 模型使用，vLLM 支持，HuggingFace 的 MPS 支持等等

Logging and usability changes

日志记录和可用性更改

New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more

新的任务包括 CoT BIG-Bench-Hard、Belebele、用户定义的任务分组等等

Please see our updated documentation pages in docs/ for more details.

请查看我们 docs/ 中更新的文档页面以获取更多详细信息。

Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the EleutherAI discord!

开发将在 main 分支上继续进行，我们鼓励您向我们提供有关所需功能的反馈以及如何进一步改进该库，或在 GitHub 上的 issues 或 PRs 中，或在 EleutherAI discord 中提出问题！

Overview

This project provides a unified framework to test generative language models on a large number of different evaluation tasks.

该项目提供了一个统一的框架，用于在大量不同的评估任务上测试生成式语言模型。

Features:

特点：

Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.

超过 60 个用于 LLM (Large Language Model) 的标准学术基准，实现了数百个子任务和变体。

Support for models loaded via transformers (including quantization via GPTQModel and AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.

支持通过 transformers（包括通过 GPTQModel 和 AutoGPTQ 进行量化）、GPT-NeoX 和 Megatron-DeepSpeed 加载的模型，具有灵活的、与分词器无关的接口。

Support for fast and memory-efficient inference with vLLM.

支持使用 vLLM 进行快速且节省内存的推理。

Support for commercial APIs including OpenAI, and TextSynth.

支持商业 API，包括 OpenAI 和 TextSynth。

Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.

支持评估 HuggingFace 的 PEFT 库中支持的适配器（例如 LoRA）。

Support for local models and benchmarks.

支持本地模型和基准测试。

Evaluation with publicly available prompts ensures reproducibility and comparability between papers.

使用公开可用的提示进行评估可确保论文之间的可重复性和可比性。

Easy support for custom prompts and evaluation metrics.

轻松支持自定义提示和评估指标。

The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.

语言模型评估工具 (Language Model Evaluation Harness) 是 🤗 Hugging Face 流行的 Open LLM Leaderboard 的后端，已在数百篇论文中使用，并被包括 NVIDIA、Cohere、BigScience、BigCode、Nous Research 和 Mosaic ML 在内的数十个组织内部使用。

Install

To install the lm-eval package from the github repository, run:

要从 github 存储库安装 lm-eval 包，请运行：

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Installing Model Backends

安装模型后端

The base installation provides the core evaluation framework. Model backends must be installed separately using optional extras:

基本安装提供核心评估框架。必须使用可选的 extras 单独安装模型后端：

For HuggingFace transformers models:

对于 HuggingFace transformers 模型：(For HuggingFace transformers models)

pip install "lm_eval[hf]"

For vLLM inference:

对于 vLLM 推理：(For vLLM inference)

pip install "lm_eval[vllm]"

For API-based models (OpenAI, Anthropic, etc.):

对于基于 API 的模型 (OpenAI, Anthropic 等)：(For API-based models (OpenAI, Anthropic, etc.))

pip install "lm_eval[api]"

Multiple backends can be installed together:

可以同时安装多个后端：(Multiple backends can be installed together)

pip install "lm_eval[hf,vllm,api]"

A detailed table of all optional extras is available at the end of this document.

本文档末尾提供了一个包含所有可选附加功能的详细表格。 (A detailed table of all optional extras is available at the end of this document.)

Basic Usage

基本用法 (Basic Usage)

Documentation

文档 (Documentation)

Guide	Description
CLI Reference	Command-line arguments and subcommands
Configuration Guide	YAML config file format and examples
Python API	Programmatic usage with `simple_evaluate()`
Task Guide	Available tasks and task configuration

Use lm-eval -h to see available options, or lm-eval run -h for evaluation options.

使用 lm-eval -h 查看可用选项，或使用 lm-eval run -h 查看评估选项。 (Use lm-eval -h to see available options, or lm-eval run -h for evaluation options.)

List available tasks with:

列出可用任务：(List available tasks with:)

lm-eval ls tasks

Hugging Face `transformers`

Hugging Face transformers

[!Important] To use the HuggingFace backend, first install: pip install "lm_eval[hf]"

[!Important] 要使用 HuggingFace 后端，请先安装：pip install "lm_eval[hf]"

To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag you can use the following command (this assumes you are using a CUDA-compatible GPU):

要在 hellaswag 上评估托管在 HuggingFace Hub (例如 GPT-J-6B) 上的模型，您可以使用以下命令（这假设您正在使用兼容 CUDA 的 GPU）：(To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag you can use the following command (this assumes you are using a CUDA-compatible GPU):)

lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8

Additional arguments can be provided to the model constructor using the --model_args flag. Most notably, this supports the common practice of using the revisions feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:

可以使用 --model_args 标志将其他参数提供给模型构造函数。最值得注意的是，这支持使用 Hub 上的修订功能来存储部分训练的检查点或指定运行模型的数据类型的常见做法：(Additional arguments can be provided to the model constructor using the --model_args flag. Most notably, this supports the common practice of using the revisions feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:)

lm_eval --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks lambada_openai,hellaswag \
    --device cuda:0 \
    --batch_size 8

Models that are loaded via both transformers.AutoModelForCausalLM (autoregressive, decoder-only GPT style models) and transformers.AutoModelForSeq2SeqLM (such as encoder-decoder models like T5) in Huggingface are supported.

支持通过 Huggingface 中的 transformers.AutoModelForCausalLM (自回归、仅解码器的 GPT 风格模型) 和 transformers.AutoModelForSeq2SeqLM (例如像 T5 这样的编码器-解码器模型) 加载的模型。(Models that are loaded via both transformers.AutoModelForCausalLM (autoregressive, decoder-only GPT style models) and transformers.AutoModelForSeq2SeqLM (such as encoder-decoder models like T5) in Huggingface are supported.)

Batch size selection can be automated by setting the --batch_size flag to auto. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append :N to above flag to automatically recompute the largest batch size N times. For example, to recompute the batch size 4 times, the command would be:

可以通过将 --batch_size 标志设置为 auto 来自动选择批量大小。这将自动检测适合您设备的最大批量大小。在最长和最短示例之间存在很大差异的任务中，定期重新计算最大批量大小可能有助于进一步加速。为此，请将 :N 附加到上述标志，以自动重新计算最大批量大小 N 次。例如，要重新计算批量大小 4 次，该命令将是：(Batch size selection can be automated by setting the --batch_size flag to auto. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append :N to above flag to automatically recompute the largest batch size N times. For example, to recompute the batch size 4 times, the command would be:)

lm_eval --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks lambada_openai,hellaswag \
    --device cuda:0 \
    --batch_size auto:4

[!Note] Just like you can provide a local path to transformers.AutoModel, you can also provide a local path to lm_eval via --model_args pretrained=/path/to/model

[!Note] 就像您可以为 transformers.AutoModel 提供本地路径一样，您也可以通过 --model_args pretrained=/path/to/model 为 lm_eval 提供本地路径 (Just like you can provide a local path to transformers.AutoModel, you can also provide a local path to lm_eval via --model_args pretrained=/path/to/model)

Evaluating GGUF Models

评估 GGUF 模型 (Evaluating GGUF Models)

lm-eval supports evaluating models in GGUF format using the Hugging Face (hf) backend. This allows you to use quantized models compatible with transformers, AutoModel, and llama.cpp conversions.

lm-eval 支持使用 Hugging Face (hf) 后端评估 GGUF 格式的模型。这使您可以使用与 transformers、AutoModel 和 llama.cpp 转换兼容的量化模型。(lm-eval supports evaluating models in GGUF format using the Hugging Face (hf) backend. This allows you to use quantized models compatible with transformers, AutoModel, and llama.cpp conversions.)

To evaluate a GGUF model, pass the path to the directory containing the model weights, the gguf_file, and optionally a separate tokenizer path using the --model_args flag.

要评估 GGUF 模型，请使用 --model_args 标志传递包含模型权重的目录的路径、gguf_file 以及可选的单独的分词器路径。(To evaluate a GGUF model, pass the path to the directory containing the model weights, the gguf_file, and optionally a separate tokenizer path using the --model_args flag.)

🚨 Important Note:
If no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take hours or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.

🚨 重要提示：如果没有提供单独的分词器，Hugging Face 将尝试从 GGUF 文件中重建分词器——这可能需要数小时，甚至无限期地挂起。传递一个单独的分词器可以避免这个问题，并且可以将分词器加载时间从数小时减少到几秒钟。(🚨 Important Note:If no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take hours or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.)

✅ Recommended usage:

✅ 推荐用法：(✅ Recommended usage:)

lm_eval --model hf \
    --model_args pretrained=/path/to/gguf_folder,gguf_file=model-name.gguf,tokenizer=/path/to/tokenizer \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8

[!Tip] Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).

[!Tip] 确保分词器路径指向有效的 Hugging Face 分词器目录（例如，包含 tokenizer_config.json、vocab.json 等）。(Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).)

Multi-GPU Evaluation with Hugging Face `accelerate`

使用 Hugging Face accelerate 进行多 GPU 评估 (Multi-GPU Evaluation with Hugging Face accelerate)

We support three main ways of using Hugging Face's accelerate 🚀 library for multi-GPU evaluation.

我们支持使用 Hugging Face 的 accelerate 🚀 库进行多 GPU 评估的三种主要方式。 (We support three main ways of using Hugging Face's accelerate 🚀 library for multi-GPU evaluation.)

To perform data-parallel evaluation (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:

要执行数据并行评估（其中每个 GPU 加载模型的单独完整副本），我们按如下方式利用 accelerate 启动器：(To perform data-parallel evaluation (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:)

accelerate launch -m lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --batch_size 16

(or via accelerate launch --no-python lm_eval).

(或者通过 accelerate launch --no-python lm_eval)。(or via accelerate launch --no-python lm_eval.)

For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.

对于您的模型可以适合单个 GPU 的情况，这使您可以在 K 个 GPU 上进行评估，速度比在一个 GPU 上快 K 倍。(For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.)

WARNING: This setup does not work with FSDP model sharding, so in accelerate config FSDP must be disabled, or the NO_SHARD FSDP option must be used.

警告：此设置不适用于 FSDP 模型分片，因此在 accelerate 配置中必须禁用 FSDP，或者必须使用 NO_SHARD FSDP 选项。(WARNING: This setup does not work with FSDP (Fully Sharded Data Parallel) model sharding, so in accelerate config FSDP must be disabled, or the NO_SHARD FSDP option must be used.)

The second way of using accelerate for multi-GPU evaluation is when your model is too large to fit on a single GPU.

使用 accelerate 进行多 GPU 评估的第二种方式是当您的模型太大而无法放入单个 GPU 时。(The second way of using accelerate for multi-GPU evaluation is when your model is too large to fit on a single GPU.)

In this setting, run the library outside the accelerate launcher, but passing parallelize=True to --model_args as follows:

在这种情况下，在 accelerate 启动器之外运行该库，但将 parallelize=True 传递给 --model_args，如下所示：(In this setting, run the library outside the accelerate launcher, but passing parallelize=True to --model_args as follows:)

lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --model_args parallelize=True \
    --batch_size 16

This means that your model's weights will be split across all available GPUs.

这意味着您的模型的权重将在所有可用的 GPU 上进行拆分。(This means that your model's weights will be split across all available GPUs.)

For more advanced users or even larger models, we allow for the following arguments when parallelize=True as well:

对于更高级的用户，甚至是更大的模型，我们允许在 parallelize=True 时使用以下参数：(For more advanced users or even larger models, we allow for the following arguments when parallelize=True as well:)

device_map_option: How to split model weights across available GPUs. defaults to "auto".

device_map_option：如何在可用的 GPU 之间拆分模型权重。默认为 "auto"。(device_map_option: How to split model weights across available GPUs. defaults to "auto".)

max_memory_per_gpu: the max GPU memory to use per GPU in loading the model.

max_memory_per_gpu：加载模型时每个 GPU 要使用的最大 GPU 内存。(max_memory_per_gpu: the max GPU memory to use per GPU in loading the model.)

max_cpu_memory: the max amount of CPU memory to use when offloading the model weights to RAM.

max_cpu_memory：将模型权重卸载到 RAM 时要使用的最大 CPU 内存量。(max_cpu_memory: the max amount of CPU memory to use when offloading the model weights to RAM.)

offload_folder: a folder where model weights will be offloaded to disk if needed.

offload_folder：如果需要，模型权重将被卸载到磁盘的文件夹。(offload_folder: a folder where model weights will be offloaded to disk if needed.)

The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.

第三个选项是同时使用这两种方法。这将使您可以利用数据并行和模型分片，并且对于太大而无法放入单个 GPU 的模型特别有用。(The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.)

accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
    -m lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --model_args parallelize=True \
    --batch_size 16

To learn more about model parallelism and how to use it with the accelerate library, see the accelerate documentation

要了解有关模型并行以及如何将其与 accelerate 库一起使用的更多信息，请参阅 accelerate 文档 (To learn more about model parallelism and how to use it with the accelerate library, see the accelerate documentation)

Warning: We do not natively support multi-node evaluation using the hf model type! Please reference our GPT-NeoX library integration for an example of code in which a custom multi-machine evaluation script is written.

警告：我们不原生支持使用 hf 模型类型的多节点评估！请参考我们的 GPT-NeoX 库集成，以获取编写自定义多机评估脚本的代码示例。(Warning: We do not natively support multi-node evaluation using the hf model type! Please reference our GPT-NeoX library integration for an example of code in which a custom multi-machine evaluation script is written.)

Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework as is done for the GPT-NeoX library.

注意：我们目前不原生支持多节点评估，并建议使用外部托管服务器来运行推理请求，或者像 GPT-NeoX 库一样创建与您的分布式框架的自定义集成。(Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework as is done for the GPT-NeoX library.)

Steered Hugging Face `transformers` models

指导 Hugging Face transformers 模型 (Steered Hugging Face transformers models)

To evaluate a Hugging Face transformers model with steering vectors applied, specify the model type as steered and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained sparsify or sae_lens models (you will need to install the corresponding optional dependency for this method).

要评估应用了指导向量 (steering vectors) 的 Hugging Face transformers 模型，请将模型类型指定为 `steered`，并提供包含预定义指导向量的 PyTorch 文件路径，或者指定如何从预训练的 sparsify 或 sae_lens 模型中派生指导向量的 CSV 文件（您需要为此方法安装相应的可选依赖项）。

Specify pre-defined steering vectors:

指定预定义的指导向量 (steering vectors)：

import torch

steer_config = {
    "layers.3": {
        "steering_vector": torch.randn(1, 768),
        "bias": torch.randn(1, 768),
        "steering_coefficient": 1,
        "action": "add"
    },
}
torch.save(steer_config, "steer_config.pt")

Specify derived steering vectors:

指定派生的指导向量 (steering vectors)：

import pandas as pd

pd.DataFrame({
    "loader": ["sparsify"],
    "action": ["add"],
    "sparse_model": ["EleutherAI/sae-pythia-70m-32k"],
    "hookpoint": ["layers.3"],
    "feature_index": [30],
    "steering_coefficient": [10.0],
}).to_csv("steer_config.csv", index=False)

Run the evaluation harness with steering vectors applied:

运行应用了指导向量 (steering vectors) 的评估工具：

lm_eval --model steered \
    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \
    --tasks lambada_openai,hellaswag \
    --device cuda:0 \
    --batch_size 8

NVIDIA `nemo` models

NVIDIA nemo 模型 (NVIDIA nemo models)

NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers working on language models.

NVIDIA NeMo Framework 是一个生成式人工智能框架，专为从事语言模型研究的研究人员和 pytorch 开发人员而构建。

To evaluate a nemo model, start by installing NeMo following the documentation. We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see latest released containers). Please also install the lm evaluation harness library following the instructions in the Install section.

要评估 nemo 模型，首先按照文档安装 NeMo。我们强烈建议使用 NVIDIA PyTorch 或 NeMo 容器，尤其是在安装 Apex 或任何其他依赖项时遇到问题时（请参阅最新发布的容器）。另请按照安装部分的说明安装 lm evaluation harness 库。

NeMo models can be obtained through NVIDIA NGC Catalog or in NVIDIA's Hugging Face page. In NVIDIA NeMo Framework there are conversion scripts to convert the hf checkpoints of popular models like llama, falcon, mixtral or mpt to nemo.

NeMo 模型可以通过 NVIDIA NGC Catalog 或 NVIDIA 的 Hugging Face 页面获得。在 NVIDIA NeMo Framework 中，有转换脚本可以将流行的模型（如 llama、falcon、mixtral 或 mpt）的 hf 检查点 (checkpoint) 转换为 nemo。

Run a nemo model on one GPU:

在一个 GPU 上运行 nemo 模型：

lm_eval --model nemo_lm \
    --model_args path=<path_to_nemo_model> \
    --tasks hellaswag \
    --batch_size 32

It is recommended to unpack the nemo model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:

建议解压 nemo 模型，以避免在 docker 容器内解压 - 这可能会导致磁盘空间溢出。为此，您可以运行：

mkdir MY_MODEL
tar -xvf MY_MODEL.nemo -c MY_MODEL

Multi-GPU evaluation with NVIDIA `nemo` models

使用 NVIDIA nemo 模型进行多 GPU 评估

By default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.

默认情况下，仅使用一个 GPU。但是，我们确实支持在单个节点上进行评估期间的数据复制或张量/流水线并行 (tensor/pipeline parallelism)。

To enable data replication, set the model_args of devices to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:

要启用数据复制，请将 `devices` 的 `model_args` 设置为要运行的数据副本数。例如，在 8 个 GPU 上运行 8 个数据副本的命令是：

torchrun --nproc-per-node=8 --no-python lm_eval \
    --model nemo_lm \
    --model_args path=<path_to_nemo_model>,devices=8 \
    --tasks hellaswag \
    --batch_size 32

To enable tensor and/or pipeline parallelism, set the model_args of tensor_model_parallel_size and/or pipeline_model_parallel_size. In addition, you also have to set up devices to be equal to the product of tensor_model_parallel_size and/or pipeline_model_parallel_size. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:

要启用张量和/或流水线并行，请设置 `tensor_model_parallel_size` 和/或 `pipeline_model_parallel_size` 的 `model_args`。此外，您还必须将 `devices` 设置为等于 `tensor_model_parallel_size` 和/或 `pipeline_model_parallel_size` 的乘积。例如，要使用具有 2 个张量并行和 2 个流水线并行的 4 个 GPU 的一个节点的命令是：

torchrun --nproc-per-node=4 --no-python lm_eval \
    --model nemo_lm \
    --model_args path=<path_to_nemo_model>,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \
    --tasks hellaswag \
    --batch_size 32

Note that it is recommended to substitute the python command by torchrun --nproc-per-node=<number of devices> --no-python to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.

请注意，建议用 `torchrun --nproc-per-node=<设备数量> --no-python` 替换 python 命令，以方便将模型加载到 GPU 中。对于加载到多个 GPU 的大型检查点 (checkpoint) 而言，这一点尤其重要。

Not supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.

尚未支持：多节点评估以及数据复制与张量或流水线并行的组合。

Megatron-LM models

Megatron-LM 模型 (Megatron-LM models)

Megatron-LM is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.

Megatron-LM 是 NVIDIA 的大规模 Transformer 训练框架。此后端允许直接评估 Megatron-LM 检查点 (checkpoint)，而无需转换。

Requirements:

要求：

Megatron-LM must be installed or accessible via MEGATRON_PATH environment variable

必须安装 Megatron-LM 或通过 `MEGATRON_PATH` 环境变量访问

PyTorch with CUDA support

支持 CUDA 的 PyTorch

Setup:

设置：

# Set environment variable pointing to Megatron-LM installation
export MEGATRON_PATH=/path/to/Megatron-LM

Basic usage (single GPU):

基本用法（单 GPU）：

lm_eval --model megatron_lm \
    --model_args load=/path/to/checkpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=/path/to/tokenizer \
    --tasks hellaswag \
    --batch_size 1

Supported checkpoint formats:

支持的检查点 (checkpoint) 格式：

Standard Megatron checkpoints (model_optim_rng.pt)

标准 Megatron 检查点 (model_optim_rng.pt)

Distributed checkpoints (.distcp format, auto-detected)

分布式检查点 (.distcp 格式，自动检测)

Parallelism Modes

并行模式 (Parallelism Modes)

The Megatron-LM backend supports the following parallelism modes:

Megatron-LM 后端支持以下并行模式：

Mode	Configuration	Description
Single GPU	`devices=1` (default)	Standard single GPU evaluation
Data Parallelism	`devices>1, TP=1`	Each GPU has a full model replica, data is distributed
Tensor Parallelism	`TP == devices`	Model layers are split across GPUs
Expert Parallelism	`EP == devices, TP=1`	For MoE models, experts are distributed across GPUs

[!Note]

Pipeline Parallelism (PP > 1) is not currently supported.

Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).

[!注意] 当前不支持流水线并行 (Pipeline Parallelism)（PP > 1）。专家并行 (Expert Parallelism) (EP) 不能与张量并行 (Tensor Parallelism) (TP) 结合使用。

Data Parallelism (4 GPUs, each with full model replica):

数据并行 (Data Parallelism)（4 个 GPU，每个 GPU 都有完整的模型副本）：

torchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \
    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=4 \
    --tasks hellaswag

Tensor Parallelism (TP=2):

张量并行 (Tensor Parallelism)（TP=2）：

torchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \
    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=2,tensor_model_parallel_size=2 \
    --tasks hellaswag

Expert Parallelism for MoE models (EP=4):

MoE 模型的专家并行 (Expert Parallelism)（EP=4）：

torchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \
    --model_args load=/path/to/moe_checkpoint,tokenizer_model=/path/to/tokenizer,devices=4,expert_model_parallel_size=4 \
    --tasks hellaswag

Using extra_args for additional Megatron options:

使用 extra_args 获取其他 Megatron 选项：

lm_eval --model megatron_lm \
    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,extra_args="--no-rope-fusion --trust-remote-code" \
    --tasks hellaswag

[!Note] The --use-checkpoint-args flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.

[!注意] 默认情况下启用 `--use-checkpoint-args` 标志，该标志从检查点 (checkpoint) 加载模型架构参数。对于通过 Megatron-Bridge 转换的检查点 (checkpoint)，这通常包括所有必要的模型配置。

Multi-GPU evaluation with OpenVINO models

使用 OpenVINO 模型进行多 GPU 评估

Pipeline parallelism during evaluation is supported with OpenVINO models

OpenVINO 模型支持在评估期间进行流水线并行 (pipeline parallelism)

To enable pipeline parallelism, set the model_args of pipeline_parallel. In addition, you also have to set up device to value HETERO:<GPU index1>,<GPU index2> for example HETERO:GPU.1,GPU.0 For example, the command to use pipeline parallelism of 2 is:

要启用流水线并行，请设置 `pipeline_parallel` 的 `model_args`。此外，您还必须将 `device` 设置为值 `HETERO:,`，例如 `HETERO:GPU.1,GPU.0`。例如，使用 2 个流水线并行的命令是：

lm_eval --model openvino \
    --tasks wikitext \
    --model_args pretrained=<path_to_ov_model>,pipeline_parallel=True \
    --device HETERO:GPU.1,GPU.0

Tensor + Data Parallel and Optimized Inference with `vLLM`

使用 vLLM 进行张量 + 数据并行和优化推理 (Tensor + Data Parallel and Optimized Inference with vLLM)

We also support vLLM for faster inference on supported model types, especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:

我们还支持 vLLM，以便在支持的模型类型上实现更快的推理，尤其是在将模型拆分到多个 GPU 上时速度更快。对于单 GPU 或多 GPU - 张量并行、数据并行或两者的组合 - 推理，例如：

lm_eval --model vllm \
    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
    --tasks lambada_openai \
    --batch_size auto

To use vllm, do pip install "lm_eval[vllm]". For a full list of supported vLLM configurations, please reference our vLLM integration and the vLLM documentation.

使用 vllm，请执行 pip install "lm_eval[vllm]"。有关支持的 vLLM (vLLM) 配置的完整列表，请参考我们的 vLLM 集成和 vLLM 文档。

vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a script for checking the validity of vllm results against HF.

vLLM (vLLM) 的输出有时与 Huggingface 不同。我们将 Huggingface 作为参考实现，并提供一个脚本来检查 vllm (vLLM) 结果对于 HF 的有效性。

[!Tip] For fastest performance, we recommend using --batch_size auto for vLLM whenever possible, to leverage its continuous batching functionality!

[!Tip] 为了获得最快的性能，我们建议尽可能对 vLLM (vLLM) 使用 --batch_size auto，以利用其连续批处理功能！

[!Tip] Passing max_model_len=4096 or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.

[!Tip] 通过 model args 将 max_model_len=4096 或其他合理的默认值传递给 vLLM (vLLM)，可能会加速或防止内存不足错误，尤其是在尝试使用自动批量大小时，例如对于默认为最大长度 32k 的 Mistral-7B-v0.1。

Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`

使用 SGLang 的张量 + 数据并行和快速离线批量推理

We support SGLang for efficient offline batch inference. Its Fast Backend Runtime delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).

我们支持 SGLang 用于高效的离线批量推理。其快速后端运行时通过优化的内存管理和并行处理技术提供高性能。主要功能包括张量并行、连续批处理和对各种量化方法（FP8/INT4/AWQ/GPTQ）的支持。

To use SGLang as the evaluation backend, please install it in advance via SGLang documents here.

要使用 SGLang 作为评估后端，请提前通过此处的 SGLang 文档安装它。

[!Tip] Due to the installing method of Flashinfer-- a fast attention kernel library, we don't include the dependencies of SGLang within pyproject.toml. Note that the Flashinfer also has some requirements on torch version.

[!Tip] 由于 Flashinfer (Flashinfer)-- 一个快速注意力内核库的安装方法，我们没有在 pyproject.toml 中包含 SGLang 的依赖项。请注意，Flashinfer (Flashinfer) 对 torch 版本也有一些要求。

SGLang's server arguments are slightly different from other backends, see here for more information. We provide an example of the usage here:

SGLang 的服务器参数与其他后端略有不同，有关更多信息，请参见此处。我们在此处提供了一个用例示例：

lm_eval --model sglang \
    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \
    --tasks gsm8k_cot \
    --batch_size auto

[!Tip] When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:

Use a manual batch_size, rather than auto.

Lower KV cache pool memory usage by adjusting mem_fraction_static - Add to your model arguments for example --model_args pretrained=...,mem_fraction_static=0.7.

Increase tensor parallel size tp_size (if using multiple GPUs).

[!Tip] 当遇到内存不足 (OOM) 错误时（特别是对于多项选择题），请尝试以下解决方案：使用手动 batch_size，而不是 auto。通过调整 mem_fraction_static 来降低 KV 缓存池内存使用量 - 例如，添加到您的模型参数中 --model_args pretrained=...,mem_fraction_static=0.7。增加张量并行大小 tp_size（如果使用多个 GPU）。

Windows ML

Windows ML (Windows ML)

We support Windows ML for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and NPU (Neural Processing Unit) devices.

我们支持 Windows ML (Windows ML)，用于在 Windows 平台上进行硬件加速推理。这使得可以在 CPU、GPU 和 NPU (Neural Processing Unit，神经处理单元) 设备上进行评估。

Windows ML? https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview

什么是 Windows ML？ https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview

To use Windows ML, install the required dependencies:

要使用 Windows ML (Windows ML)，请安装所需的依赖项：

pip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml

Evaluate an ONNX Runtime GenAI LLM on NPU/GPU/CPU on Windows:

在 Windows 上的 NPU/GPU/CPU 上评估 ONNX Runtime GenAI LLM：

lm_eval --model winml \
    --model_args pretrained=/path/to/onnx/model \
    --tasks mmlu \
    --batch_size 1

[!Note] The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting transformers.js won't work. You can verify this by finding the genai_config.json file in the model folder.

[!Note] Windows ML (Windows ML) 后端仅适用于 ONNX Runtime GenAI 模型格式。针对 transformers.js 的模型将不起作用。您可以通过在模型文件夹中找到 genai_config.json 文件来验证这一点。

[!Note] To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work / work well on other vendor or device types. To learn more on model conversion, please visit Microsoft AI Tool Kit

[!Note] 要在目标设备上运行 ONNX Runtime GenAI 模型，您必须将原始模型转换为该供应商和设备类型。转换后的模型在其他供应商或设备类型上将无法工作/工作不佳。要了解有关模型转换的更多信息，请访问 Microsoft AI Tool Kit。

Model APIs and Inference Servers

模型 API 和推理服务器

[!Important] To use API-based models, first install: pip install "lm_eval[api]"

[!Important] 要使用基于 API 的模型，请首先安装：pip install "lm_eval[api]"

Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.

我们的库还支持通过多个商业 API 提供的模型的评估，我们希望实现对最常用的高性能本地/自托管推理服务器的支持。

To call a hosted model, use:

要调用托管模型，请使用：

export OPENAI_API_KEY=YOUR_KEY_HERE
lm_eval --model openai-completions \
    --model_args model=davinci-002 \
    --tasks lambada_openai,hellaswag

We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.

我们还支持将您自己的本地推理服务器与镜像 OpenAI Completions 和 ChatCompletions API 的服务器一起使用。

lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16

Note that for externally hosted models, configs such as --device which relate to where to place a local model should not be used and do not function. Just like you can use --model_args to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.

请注意，对于外部托管模型，不应使用与放置本地模型位置相关的配置（例如 --device），并且这些配置不起作用。正如您可以使用 --model_args 将任意参数传递给本地模型的模型构造函数一样，您也可以使用它将任意参数传递给托管模型的模型 API。有关他们支持哪些参数的信息，请参阅托管服务的文档。

API or Inference Server	Implemented?	`--model <xxx>` name	Models supported:	Request Types:
OpenAI Completions	:heavy_check_mark:	`openai-completions`, `local-completions`	All OpenAI Completions API models	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
OpenAI ChatCompletions	:heavy_check_mark:	`openai-chat-completions`, `local-chat-completions`	All ChatCompletions API models	`generate_until` (no logprobs)
Anthropic	:heavy_check_mark:	`anthropic`	Supported Anthropic Engines	`generate_until` (no logprobs)
Anthropic Chat	:heavy_check_mark:	`anthropic-chat`, `anthropic-chat-completions`	Supported Anthropic Engines	`generate_until` (no logprobs)
Textsynth	:heavy_check_mark:	`textsynth`	All supported engines	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Cohere	:hourglass: - blocked on Cohere API bug	N/A	All `cohere.generate()` engines	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Llama.cpp (via llama-cpp-python)	:heavy_check_mark:	`gguf`, `ggml`	All models supported by llama.cpp	`generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented)
vLLM	:heavy_check_mark:	`vllm`	Most HF Causal Language Models	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Mamba	:heavy_check_mark:	`mamba_ssm`	Mamba architecture Language Models via the `mamba_ssm` package	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Huggingface Optimum (Causal LMs)	:heavy_check_mark:	`openvino`	Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Huggingface Optimum-intel IPEX (Causal LMs)	:heavy_check_mark:	`ipex`	Any decoder-only AutoModelForCausalLM	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Neuron via AWS Inf2 (Causal LMs)	:heavy_check_mark:	`neuronx`	Any decoder-only AutoModelForCausalLM supported to run on huggingface-ami image for inferentia2	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
NVIDIA NeMo	:heavy_check_mark:	`nemo_lm`	All supported models	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
NVIDIA Megatron-LM	:heavy_check_mark:	`megatron_lm`	Megatron-LM GPT models (standard and distributed checkpoints)	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Watsonx.ai	:heavy_check_mark:	`watsonx_llm`	Supported Watsonx.ai Engines	`generate_until` `loglikelihood`
Windows ML	:heavy_check_mark:	`winml`	ONNX models in GenAI format	`generate_until`, `loglikelihood`, `loglikelihood_rolling`
Your local inference server!	:heavy_check_mark:	`local-completions` or `local-chat-completions`	Support for OpenAI API-compatible servers, with easy customization for other APIs.	`generate_until`, `loglikelihood`, `loglikelihood_rolling`

Models which do not supply logits or logprobs can be used with tasks of type generate_until only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: generate_until, loglikelihood, loglikelihood_rolling, and multiple_choice.

不提供 logits 或 logprobs 的模型只能用于 generate_until 类型的任务，而本地模型或提供其 prompts 的 logprobs/logits 的 API 可以在所有任务类型上运行：generate_until、loglikelihood、loglikelihood_rolling 和 multiple_choice。

For more information on the different task output_types and model request types, see our documentation.

有关不同任务 output_types 和模型请求类型的更多信息，请参阅我们的文档。

[!Note] For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using --limit 10 first to confirm answer extraction and scoring on generative tasks is performing as expected. providing system="<some system prompt here>" within --model_args for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.

[!Note] 为了在使用 Anthropic Claude 3 和 GPT-4 等封闭聊天模型 API 时获得最佳性能，我们建议首先使用 --limit 10 仔细查看一些示例输出，以确认生成任务的答案提取和评分是否按预期执行。在 --model_args 中为 anthropic-chat-completions 提供 system="<some system prompt here>"，以指示模型以何种格式响应，可能很有用。

Other Frameworks

其他框架

A number of other libraries contain scripts for calling the eval harness through their library. These include GPT-NeoX, Megatron-DeepSpeed, and mesh-transformer-jax.

许多其他库包含通过其库调用 eval harness 的脚本。这些包括 GPT-NeoX、Megatron-DeepSpeed 和 mesh-transformer-jax。

To create your own custom integration you can follow instructions from this tutorial.

要创建您自己的自定义集成，您可以按照本教程中的说明进行操作。

Additional Features

附加功能

[!Note] For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the --predict_only flag is available to obtain decoded generations for post-hoc evaluation.

[!Note] 对于不适合直接评估的任务——无论是与执行不受信任的代码相关的风险，还是评估过程中的复杂性——可以使用 --predict_only 标志来获取解码生成，以进行事后评估。

If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing --device cuda:0 with --device mps (requires PyTorch version 2.1 or higher). Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on --device cpu and --device mps match.

如果您有 Metal 兼容的 Mac，您可以通过将 --device cuda:0 替换为 --device mps（需要 PyTorch 版本 2.1 或更高版本）来使用 MPS 后端运行 eval harness。请注意，PyTorch MPS 后端仍处于开发的早期阶段，因此可能存在正确性问题或不支持的操作。如果您观察到 MPS 后端上的模型性能出现异常，我们建议首先检查您的模型在 --device cpu 和 --device mps 上的正向传递是否匹配。

[!Note] You can inspect what the LM inputs look like by running the following command:
python write_out.py \
    --tasks <task1,task2,...> \
    --num_fewshot 5 \
    --num_examples 10 \
    --output_base_path /path/to/output/folder
This will write out one text file for each task.

[!Note] 您可以通过运行以下命令来检查 LM 输入的外观： python write_out.py \ --tasks <task1,task2,...> \ --num_fewshot 5 \ --num_examples 10 \ --output_base_path /path/to/output/folder 这将为每个任务写出一个文本文件。

To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the --check_integrity flag:

为了验证您正在执行的任务的数据完整性，除了运行任务本身之外，您还可以使用 --check_integrity 标志：

lm_eval --model openai \
    --model_args engine=davinci-002 \
    --tasks lambada_openai,hellaswag \
    --check_integrity

Advanced Usage Tips

高级使用技巧

For models loaded with the HuggingFace transformers library, any arguments provided via --model_args get passed to the relevant constructor directly. This means that anything you can do with AutoModel can be done with our library. For example, you can pass a local path via pretrained= or use models finetuned with PEFT by taking the call you would run to evaluate the base model and add ,peft=PATH to the model_args argument:

对于使用 HuggingFace transformers 库加载的模型，通过 --model_args 提供的任何参数都会直接传递给相关的构造函数。这意味着您可以使用 AutoModel 完成的任何操作都可以使用我们的库完成。例如，您可以通过 pretrained= 传递本地路径，或者使用通过 PEFT 微调的模型，方法是采用您运行以评估基本模型的调用，并将 ,peft=PATH 添加到 model_args 参数：

lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \
    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
    --device cuda:0

Models provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:

可以使用 Hugging Face transformers 库轻松加载作为 delta 权重提供的模型。在 --model_args 中，设置 delta 参数以指定 delta 权重，并使用 pretrained 参数来指定将应用它们的相对基本模型：

lm_eval --model hf \
    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \
    --tasks hellaswag

GPTQ quantized models can be loaded using GPTQModel (faster) or AutoGPTQ

可以使用 GPTQModel（更快）或 AutoGPTQ 加载 GPTQ 量化模型。

GPTQModel: add ,gptqmodel=True to model_args

GPTQModel：将 ,gptqmodel=True 添加到 model_args

lm_eval --model hf \
    --model_args pretrained=model-name-or-path,gptqmodel=True \
    --tasks hellaswag

AutoGPTQ: add ,autogptq=True to model_args:

AutoGPTQ：将 ,autogptq=True 添加到 model_args：

lm_eval --model hf \
    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag

We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via --task lambada_openai_mt_*.

我们支持任务名称中的通配符 (wildcards)，例如，您可以通过 `--task lambada_openai_mt_*` 运行所有机器翻译的lambada任务。

Saving & Caching Results

保存与缓存结果 (Saving & Caching Results)

To save evaluation results provide an --output_path. We also support logging model responses with the --log_samples flag for post-hoc analysis.

要保存评估结果，请提供一个 `--output_path`。我们还支持使用 `--log_samples` 标志记录模型响应，以供事后分析 (post-hoc analysis)。

[!TIP] Use --use_cache <DIR> to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.

[!TIP] 使用 `--use_cache <DIR>` 来缓存评估结果，并在恢复相同（模型, 任务）对 (model, task pairs) 的运行时跳过之前评估过的样本。请注意，缓存与排名有关，因此如果中断，请使用相同的 GPU 数量重新启动。您还可以使用 `--cache_requests` 来保存数据集预处理步骤，以便更快地恢复评估。

To push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the HF_TOKEN environment variable. Then, use the --hf_hub_log_args flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - example dataset on the HF Hub. For instance:

要将结果和样本推送到 Hugging Face Hub，首先请确保在 HF_TOKEN 环境变量中设置了具有写入权限的访问令牌 (access token)。然后，使用 `--hf_hub_log_args` 标志来指定组织、存储库名称、存储库可见性，以及是否将结果和样本推送到 Hub - HF Hub 上的示例数据集。例如：

lm_eval --model hf \
    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag \
    --log_samples \
    --output_path results \
    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \

This allows you to easily download the results and samples from the Hub, using:

这样您就可以轻松地从 Hub 下载结果和样本，使用：

from datasets import load_dataset

load_dataset("EleutherAI/lm-eval-results-private", "hellaswag", "latest")

For a full list of supported arguments, check out the interface guide in our documentation!

有关支持的参数的完整列表，请查看我们文档中的接口指南！

Visualizing Results

可视化结果 (Visualizing Results)

You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.

您可以使用 Weights & Biases (W&B) 和 Zeno 无缝地可视化和分析评估工具运行的结果。

Zeno

You can use Zeno to visualize the results of your eval harness runs.

您可以使用 Zeno 可视化评估工具运行的结果。

First, head to hub.zenoml.com to create an account and get an API key on your account page. Add this key as an environment variable:

首先，前往 hub.zenoml.com 创建一个帐户，并在您的帐户页面上获取 API 密钥。将此密钥添加为环境变量：

export ZENO_API_KEY=[your api key]

You'll also need to install the lm_eval[zeno] package extra.

您还需要安装 lm_eval[zeno] 软件包的额外组件 (package extra)。

To visualize the results, run the eval harness with the log_samples and output_path flags. We expect output_path to contain multiple folders that represent individual model names. You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.

要可视化结果，请使用 log_samples 和 output_path 标志运行评估工具。我们希望 output_path 包含表示各个模型名称的多个文件夹。因此，您可以在任意数量的任务和模型上运行评估，并将所有结果作为 Zeno 上的项目上传。

lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8 \
    --log_samples \
    --output_path output/gpt-j-6B

Then, you can upload the resulting data using the zeno_visualize script:

然后，您可以使用 zeno_visualize 脚本上传结果数据：

python scripts/zeno_visualize.py \
    --data_path output \
    --project_name "Eleuther Project"

This will use all subfolders in data_path as different models and upload all tasks within these model folders to Zeno. If you run the eval harness on multiple tasks, the project_name will be used as a prefix and one project will be created per task.

这会将 data_path 中的所有子文件夹用作不同的模型，并将这些模型文件夹中的所有任务上传到 Zeno。如果您在多个任务上运行评估工具，则 project_name 将用作前缀，并且每个任务将创建一个项目。

You can find an example of this workflow in examples/visualize-zeno.ipynb.

您可以在 examples/visualize-zeno.ipynb 中找到此工作流程的示例。

Weights and Biases

With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.

通过 Weights and Biases 集成，您现在可以花费更多时间来深入了解您的评估结果。该集成旨在简化使用 Weights & Biases (W&B) 平台记录和可视化实验结果的过程。

The integration provide functionalities

该集成提供以下功能

to automatically log the evaluation results,

自动记录评估结果，

log the samples as W&B Tables for easy visualization,

将样本记录为 W&B 表格 (W&B Tables) 以便轻松可视化，

log the results.json file as an artifact for version control,

将 results.json 文件作为工件 (artifact) 记录以进行版本控制，

log the <task_name>_eval_samples.json file if the samples are logged,

如果记录了样本，则记录 <task_name>_eval_samples.json 文件，

generate a comprehensive report for analysis and visualization with all the important metric,

生成一份包含所有重要指标的综合报告，用于分析和可视化，

log task and cli specific configs,

记录任务和 cli 特定配置，

and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

以及更多开箱即用的功能，例如用于运行评估的命令、GPU/CPU 计数、时间戳等。

First you'll need to install the lm_eval[wandb] package extra. Do pip install lm_eval[wandb].

首先，您需要安装 lm_eval[wandb] 软件包的额外组件 (package extra)。执行 pip install lm_eval[wandb]。

Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do wandb login in your command line terminal.

使用您唯一的 W&B 令牌验证您的机器。访问 https://wandb.ai/authorize 获取一个。在您的命令行终端中执行 wandb login。

Run eval harness as usual with a wandb_args flag. Use this flag to provide arguments for initializing a wandb run (wandb.init) as comma separated string arguments.

像往常一样运行评估工具，并使用 wandb_args 标志。使用此标志为初始化 wandb 运行 (wandb.init) 提供参数，以逗号分隔的字符串参数形式。

lm_eval \
    --model hf \
    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \
    --tasks hellaswag,mmlu_abstract_algebra \
    --device cuda:0 \
    --batch_size 8 \
    --output_path output/phi-2 \
    --limit 10 \
    --wandb_args project=lm-eval-harness-integration \
    --log_samples

In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in examples/visualize-wandb.ipynb, and an example of how to integrate it beyond the CLI.

在 stdout 中，您将找到指向 W&B 运行页面的链接以及指向生成的报告的链接。您可以在 examples/visualize-wandb.ipynb 中找到此工作流程的示例，以及如何将其集成到 CLI 之外的示例。

Optional Extras

可选的额外组件 (Optional Extras)

Extras dependencies can be installed via pip install -e ".[NAME]"

可以通过 pip install -e ".[NAME]" 安装额外的依赖项 (Extras dependencies)。

Model Backends

模型后端 (Model Backends)

These extras install dependencies required to run specific model backends:

这些额外组件安装运行特定模型后端 (model backends) 所需的依赖项：

NAME	Description
hf	HuggingFace Transformers (torch, transformers, accelerate, peft)
vllm	vLLM fast inference
api	API models (OpenAI, Anthropic, local servers)
gptq	AutoGPTQ quantized models
gptqmodel	GPTQModel quantized models
ibm_watsonx_ai	IBM watsonx.ai models
ipex	Intel IPEX backend
optimum	Intel OpenVINO models
neuronx	AWS Inferentia2 instances
winml	Windows ML (ONNX Runtime GenAI) - CPU/GPU/NPU
sparsify	Sparsify model steering
sae_lens	SAELens model steering

Task Dependencies

任务依赖项 (Task Dependencies)

These extras install dependencies required for specific evaluation tasks:

这些额外组件安装特定评估任务所需的依赖项：

NAME	Description
tasks	All task-specific dependencies
acpbench	ACP Bench tasks
audiolm_qwen	Qwen2 audio models
ifeval	IFEval task
japanese_leaderboard	Japanese LLM tasks
longbench	LongBench tasks
math	Math answer checking
multilingual	Multilingual tokenizers
ruler	RULER tasks

Development & Utilities

开发 & 实用工具 (Development & Utilities)

NAME	Description
dev	Linting & contributions
hf_transfer	Speed up HF downloads
sentencepiece	Sentencepiece tokenizer
unitxt	Unitxt tasks
wandb	Weights & Biases logging
zeno	Zeno result visualization

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

LM Evaluation Harness: A Framework for Few-Shot Evaluation — ppt

这是一份基于您提供的 LM Evaluation Harness 文章内容提取的 PPT 大纲，共 7 张幻灯片。

幻灯片 1：LM Evaluation Harness 简介

项目定位：一个为生成式语言模型提供大规模、多样化评测任务的统一测试框架 [1]。
海量基准测试：内置超过 60 个标准学术 LLM 评测基准，并实现了数百个子任务和变体 [1]。
广泛的应用：作为 Hugging Face Open LLM Leaderboard 的官方后端，被数百篇论文及 NVIDIA、Cohere 等数十家机构内部使用 [2]。
确保可复现性：使用公开可用的提示词进行评测，确保了不同论文间结果的可比性，并支持自定义提示词和评测指标 [2]。

幻灯片 2：核心功能与近期更新

配置驱动设计：支持基于 YAML 文件的任务创建、Jinja2 提示词设计、答案提取及多重 LM 生成 [1, 3]。
轻量化安装：最新版本将模型后端解耦，基础包不再默认包含 transformers/torch，需按需安装如 lm_eval[hf] [4, 5]。
前沿特性引入：新增对模型行为引导（Steering models）以及文本+图像多模态任务（原型阶段）的评测支持 [4]。
任务类型全覆盖：支持 generate_until、loglikelihood 等多种请求类型，适应不同模型的输出能力 [6, 7]。

幻灯片 3：丰富的模型后端支持

Hugging Face 生态：支持 transformers 模型、GPTQ/AutoGPTQ 量化模型，以及 PEFT 库中的微调适配器（如 LoRA） [2, 8]。
企业级框架支持：直接支持 NVIDIA NeMo 框架和 Megatron-LM 的检查点评估，无需格式转换 [9, 10]。
全平台硬件适配：支持 Windows ML 进行基于 CPU/GPU/NPU 的硬件加速推理，以及 Intel IPEX 和 AWS Inferentia2 [11, 12]。
本地量化模型评估：原生支持通过 Hugging Face 后端加载 GGUF 格式模型（兼容 llama.cpp） [13, 14]。

幻灯片 4：Hugging Face 与多 GPU 加速评测

自动化批处理：支持通过 --batch_size auto 自动检测并使用设备允许的最大 Batch Size，提升评测速度 [13]。
数据并行加速：基于 Hugging Face accelerate 库，支持多 GPU 上的数据并行，成倍缩短评测时间 [14, 15]。
超大模型分片：针对单卡无法装下的模型，支持通过 parallelize=True 开启模型张量切片（Tensor Parallelism） [15]。
混合并行策略：支持同时启用数据并行和模型切片，最大化利用计算集群资源 [16]。

幻灯片 5：高性能推理后端：vLLM 与 SGLang

vLLM 极速推理：利用连续批处理（Continuous batching）和张量/数据并行机制，大幅提升模型推理速度 [17]。
vLLM 自动优化：建议配合 --batch_size auto 释放 vLLM 性能，并可通过参数调整防范 OOM 问题 [17, 18]。
SGLang 离线批处理：支持高效的离线批处理推理，优化内存管理并支持 FP8/INT4/AWQ 等多种量化方法 [18]。
SGLang 显存调优：在遇到显存不足时，可手动调整 KV Cache 内存池比例或增加张量并行大小（tp_size）进行缓解 [11]。

幻灯片 6：商业 API 与本地推理服务接入

多平台 API 接入：通过安装 lm_eval[api]，一键接入 OpenAI、Anthropic Claude、Watsonx 等主流商业 API [5-7]。
兼容本地服务：完美兼容任何仿 OpenAI Completions / ChatCompletions 格式的本地推理服务器接口 [19]。
灵活传递参数：可以使用 --model_args 将额外的专有参数（如 system prompt 或重试次数）透传给 API 服务 [19, 20]。

幻灯片 7：结果缓存、记录与可视化

结果缓存与断点续评：支持 --use_cache 缓存评测进度，允许在任务中断后跳过已评估的样本，实现快速恢复 [21]。
Hugging Face Hub 同步：支持通过配置参数将评测结果和生成的样本日志直接推送到 HF Hub 存储库 [22]。
Zeno 数据可视化：深度集成 Zeno，可将多模型评测数据以项目形式上传，进行直观的可视化对比分析 [23, 24]。
Weights & Biases 集成：支持 W&B 自动记录结果、样本数据、版本控制日志并生成包含运行配置等指标的综合分析报告 [24, 25]。

博客摘要 + 核心看点点击展开

LM Evaluation Harness: A Framework for Few-Shot Evaluation — summary

寻找高效的大模型（LLM）评估工具？LM Evaluation Harness 是 EleutherAI 推出的统一评估框架，专为测试生成式语言模型而生[1, 2]。该框架集成了 60 多种标准学术基准，涵盖数百个子任务，通过公开提示词确保了评测结果的可重复性与对比性[1, 2]。它全面支持 HuggingFace、vLLM、SGLang 等模型后端以及 OpenAI 等主流商业 API，并深度优化了多 GPU 分布式与并行推理性能[2-4]。最新版本更加入了多模态模型（文本+图像）评测原型、基于 YAML 的高效任务配置以及 Weights & Biases 数据可视化集成[5-7]。作为 Hugging Face Open LLM 排行榜的官方后台，该框架为开发者提供了一站式、高扩展性的 AI 模型评估解决方案[2]。