AI 网关设计：模型路由与负载均衡

原创灵阙教研团队

S 精选进阶架构设计 | 约 10 分钟阅读更新于 2026-02-28

AI 导读

AI 网关设计：模型路由与负载均衡 LLM API 网关架构模式、智能路由策略、Fallback 链设计、成本追踪与 Token 计量引言当一个组织同时使用多个 LLM 提供商（OpenAI、Anthropic、Google、开源模型）时，直接在业务代码中硬编码 API 调用会导致紧耦合、切换困难和成本失控。AI...

AI 网关设计：模型路由与负载均衡

LLM API 网关架构模式、智能路由策略、Fallback 链设计、成本追踪与 Token 计量

引言

当一个组织同时使用多个 LLM 提供商（OpenAI、Anthropic、Google、开源模型）时，直接在业务代码中硬编码 API 调用会导致紧耦合、切换困难和成本失控。AI 网关作为统一接入层，解决了模型路由、负载均衡、成本控制、可观测性和容错等横切关注点。

本文从架构设计到工程实现，系统讲解如何构建生产级 AI 网关。

架构设计

整体拓扑

┌─────────────────────────────────────────────────────────────┐
│                     业务应用层                                │
│  ┌─────────┐  ┌──────────┐  ┌──────────┐  ┌─────────────┐  │
│  │ ChatBot │  │ Code Gen │  │ RAG App  │  │ Agent System│  │
│  └────┬────┘  └─────┬────┘  └─────┬────┘  └──────┬──────┘  │
└───────┼─────────────┼─────────────┼──────────────┼──────────┘
        │             │             │              │
        ▼             ▼             ▼              ▼
┌─────────────────────────────────────────────────────────────┐
│                      AI 网关层                                │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐  │
│  │  路由器   │  │ 限流器   │  │ 缓存层   │  │ 可观测性   │  │
│  │ (Router) │  │ (Limiter)│  │ (Cache)  │  │ (Telemetry)│  │
│  └────┬─────┘  └──────────┘  └──────────┘  └────────────┘  │
│       │                                                     │
│  ┌────▼─────────────────────────────────────────────────┐   │
│  │               Fallback Chain Manager                  │   │
│  └────┬──────────┬──────────┬──────────┬────────────────┘   │
└───────┼──────────┼──────────┼──────────┼────────────────────┘
        │          │          │          │
   ┌────▼───┐ ┌───▼────┐ ┌───▼────┐ ┌───▼────┐
   │OpenAI  │ │Anthropic│ │Google  │ │vLLM    │
   │GPT-4o  │ │Claude   │ │Gemini  │ │(自部署) │
   └────────┘ └────────┘ └────────┘ └────────┘

核心模块职责

模块	职责	关键指标
路由器	根据场景/模型/成本/延迟选择 Provider	路由决策延迟 <1ms
限流器	RPM/TPM 限制，防止超出 Provider 配额	令牌桶/滑动窗口精度
缓存层	语义缓存 + 精确匹配缓存	缓存命中率 >30%
Fallback 管理	失败自动切换，熔断恢复	故障切换延迟 <100ms
可观测性	Token 计量、延迟追踪、质量评分	数据完整性 >99.9%

智能路由策略

场景路由

不同任务场景对模型能力的要求不同，路由器根据场景类型选择最合适的模型：

// src/router/scene-router.ts
interface RouteConfig {
  scene: string;
  tiers: {
    premium: ProviderModel[];
    balanced: ProviderModel[];
    fast: ProviderModel[];
  };
}

const routeTable: RouteConfig[] = [
  {
    scene: "chat",
    tiers: {
      premium:  [{ provider: "anthropic", model: "claude-sonnet-4-20250514" }],
      balanced: [{ provider: "openai",    model: "gpt-4o" }],
      fast:     [{ provider: "google",    model: "gemini-2.0-flash" }],
    },
  },
  {
    scene: "code",
    tiers: {
      premium:  [{ provider: "anthropic", model: "claude-sonnet-4-20250514" }],
      balanced: [{ provider: "openai",    model: "gpt-4o" }],
      fast:     [{ provider: "google",    model: "gemini-2.0-flash" }],
    },
  },
  {
    scene: "vision",
    tiers: {
      premium:  [{ provider: "google",    model: "gemini-2.5-pro" }],
      balanced: [{ provider: "openai",    model: "gpt-4o" }],
      fast:     [{ provider: "google",    model: "gemini-2.0-flash" }],
    },
  },
  {
    scene: "embedding",
    tiers: {
      premium:  [{ provider: "openai",    model: "text-embedding-3-large" }],
      balanced: [{ provider: "openai",    model: "text-embedding-3-small" }],
      fast:     [{ provider: "local",     model: "bge-m3" }],
    },
  },
];

function routeRequest(scene: string, tier: string = "balanced"): ProviderModel {
  const config = routeTable.find(r => r.scene === scene);
  if (!config) throw new Error(`Unknown scene: ${scene}`);

  const candidates = config.tiers[tier];
  if (!candidates?.length) throw new Error(`No models for ${scene}/${tier}`);

  // Check availability before returning
  for (const candidate of candidates) {
    if (circuitBreaker.isAvailable(candidate.provider)) {
      return candidate;
    }
  }

  // All providers down, try fallback tier
  return routeRequest(scene, tier === "premium" ? "balanced" : "fast");
}

成本感知路由

// src/router/cost-router.ts
interface ModelPricing {
  provider: string;
  model: string;
  inputPer1M: number;   // USD per 1M input tokens
  outputPer1M: number;  // USD per 1M output tokens
  cachedPer1M?: number; // USD per 1M cached input tokens
}

const pricing: ModelPricing[] = [
  { provider: "openai",    model: "gpt-4o",              inputPer1M: 2.50, outputPer1M: 10.00 },
  { provider: "openai",    model: "gpt-4o-mini",         inputPer1M: 0.15, outputPer1M: 0.60 },
  { provider: "anthropic", model: "claude-sonnet-4-20250514", inputPer1M: 3.00, outputPer1M: 15.00 },
  { provider: "anthropic", model: "claude-haiku-3.5",    inputPer1M: 0.80, outputPer1M: 4.00 },
  { provider: "google",    model: "gemini-2.0-flash",    inputPer1M: 0.10, outputPer1M: 0.40 },
  { provider: "google",    model: "gemini-2.5-pro",      inputPer1M: 1.25, outputPer1M: 10.00 },
];

function estimateCost(
  model: ModelPricing,
  estimatedInputTokens: number,
  estimatedOutputTokens: number,
): number {
  return (
    (estimatedInputTokens / 1_000_000) * model.inputPer1M +
    (estimatedOutputTokens / 1_000_000) * model.outputPer1M
  );
}

function routeByCost(
  scene: string,
  maxCostUsd: number,
  estimatedInputTokens: number,
  estimatedOutputTokens: number,
): ProviderModel {
  const candidates = getSceneCandidates(scene);

  // Filter by budget, sort by quality (premium first)
  const affordable = candidates
    .map(c => ({
      ...c,
      cost: estimateCost(
        pricing.find(p => p.model === c.model)!,
        estimatedInputTokens,
        estimatedOutputTokens,
      ),
    }))
    .filter(c => c.cost <= maxCostUsd)
    .sort((a, b) => b.cost - a.cost); // Higher cost = higher quality (heuristic)

  if (!affordable.length) {
    throw new Error(`No model within budget $${maxCostUsd} for ${scene}`);
  }

  return affordable[0];
}

Fallback 链与容错

熔断器实现

// src/resilience/circuit-breaker.ts
enum CircuitState {
  CLOSED = "closed",       // Normal operation
  OPEN = "open",           // Failing, reject requests
  HALF_OPEN = "half_open", // Testing recovery
}

interface CircuitConfig {
  failureThreshold: number;    // Failures before opening
  recoveryTimeout: number;     // ms before trying half-open
  successThreshold: number;    // Successes in half-open to close
  monitorWindow: number;       // ms window for failure counting
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number[] = [];
  private successes = 0;
  private lastStateChange = Date.now();

  constructor(
    private provider: string,
    private config: CircuitConfig = {
      failureThreshold: 5,
      recoveryTimeout: 30_000,
      successThreshold: 3,
      monitorWindow: 60_000,
    },
  ) {}

  isAvailable(): boolean {
    if (this.state === CircuitState.CLOSED) return true;
    if (this.state === CircuitState.OPEN) {
      // Check if recovery timeout has passed
      if (Date.now() - this.lastStateChange > this.config.recoveryTimeout) {
        this.transition(CircuitState.HALF_OPEN);
        return true;
      }
      return false;
    }
    // HALF_OPEN: allow limited traffic
    return true;
  }

  recordSuccess(): void {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successes++;
      if (this.successes >= this.config.successThreshold) {
        this.transition(CircuitState.CLOSED);
      }
    }
    // Reset failure window
    this.failures = [];
  }

  recordFailure(error: Error): void {
    const now = Date.now();
    this.failures.push(now);

    // Clean old failures outside monitoring window
    this.failures = this.failures.filter(
      t => now - t < this.config.monitorWindow
    );

    if (this.state === CircuitState.HALF_OPEN) {
      this.transition(CircuitState.OPEN);
      return;
    }

    if (this.failures.length >= this.config.failureThreshold) {
      this.transition(CircuitState.OPEN);
    }
  }

  private transition(newState: CircuitState): void {
    console.log(
      `CircuitBreaker [${this.provider}]: ${this.state} -> ${newState}`
    );
    this.state = newState;
    this.lastStateChange = Date.now();
    this.successes = 0;
    if (newState === CircuitState.CLOSED) {
      this.failures = [];
    }
  }
}

Fallback 链执行器

// src/resilience/fallback-chain.ts
interface FallbackResult {
  response: LLMResponse;
  provider: string;
  model: string;
  attempts: AttemptRecord[];
}

interface AttemptRecord {
  provider: string;
  model: string;
  latencyMs: number;
  error?: string;
}

async function executeWithFallback(
  request: LLMRequest,
  chain: ProviderModel[],
): Promise<FallbackResult> {
  const attempts: AttemptRecord[] = [];

  for (const { provider, model } of chain) {
    const breaker = getCircuitBreaker(provider);

    if (!breaker.isAvailable()) {
      attempts.push({
        provider, model, latencyMs: 0,
        error: "Circuit breaker OPEN",
      });
      continue;
    }

    const start = Date.now();
    try {
      const response = await callProvider(provider, model, request, {
        timeout: 30_000,
        retries: 1,
      });

      breaker.recordSuccess();
      attempts.push({ provider, model, latencyMs: Date.now() - start });

      return { response, provider, model, attempts };
    } catch (error) {
      const latencyMs = Date.now() - start;
      breaker.recordFailure(error);
      attempts.push({
        provider, model, latencyMs,
        error: error.message,
      });

      // Log for observability
      console.error(
        `Fallback: ${provider}/${model} failed (${latencyMs}ms): ${error.message}`
      );
    }
  }

  throw new FallbackExhaustedError(
    `All providers failed for request`,
    attempts,
  );
}

Token 计量与成本追踪

精确 Token 计数

// src/metering/token-counter.ts
import { encode } from "gpt-tokenizer";    // tiktoken compatible
import Anthropic from "@anthropic-ai/sdk";

interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  cachedTokens?: number;
  totalTokens: number;
  estimatedCostUsd: number;
}

function countTokens(
  provider: string,
  model: string,
  messages: Message[],
  response: LLMResponse,
): TokenUsage {
  // Most providers return token counts in response
  const usage = response.usage;

  if (usage) {
    const modelPricing = getPricing(provider, model);
    return {
      inputTokens: usage.input_tokens ?? usage.prompt_tokens,
      outputTokens: usage.output_tokens ?? usage.completion_tokens,
      cachedTokens: usage.cache_read_input_tokens ?? 0,
      totalTokens: (usage.input_tokens ?? usage.prompt_tokens) +
                   (usage.output_tokens ?? usage.completion_tokens),
      estimatedCostUsd: calculateCost(modelPricing, usage),
    };
  }

  // Fallback: client-side estimation
  const inputText = messages.map(m => m.content).join(" ");
  const inputTokens = encode(inputText).length;
  const outputTokens = encode(response.content).length;

  return {
    inputTokens,
    outputTokens,
    totalTokens: inputTokens + outputTokens,
    estimatedCostUsd: 0, // Mark as estimated
  };
}

成本看板数据模型

-- Cost tracking schema
CREATE TABLE llm_usage_log (
    id              BIGSERIAL PRIMARY KEY,
    request_id      UUID NOT NULL,
    timestamp       TIMESTAMPTZ DEFAULT NOW(),

    -- Routing info
    scene           TEXT NOT NULL,
    tier            TEXT NOT NULL,
    provider        TEXT NOT NULL,
    model           TEXT NOT NULL,

    -- Token usage
    input_tokens    INTEGER NOT NULL,
    output_tokens   INTEGER NOT NULL,
    cached_tokens   INTEGER DEFAULT 0,
    total_tokens    INTEGER GENERATED ALWAYS AS (input_tokens + output_tokens) STORED,

    -- Cost
    cost_usd        NUMERIC(10, 6) NOT NULL,

    -- Performance
    latency_ms      INTEGER NOT NULL,
    ttft_ms         INTEGER,           -- Time to First Token

    -- Context
    user_id         TEXT,
    team_id         TEXT,
    app_id          TEXT NOT NULL,
    was_fallback    BOOLEAN DEFAULT FALSE,
    fallback_chain  JSONB,             -- Full attempt history
    cache_hit       BOOLEAN DEFAULT FALSE
);

-- Daily cost aggregation view
CREATE MATERIALIZED VIEW daily_cost_by_team AS
SELECT
    DATE_TRUNC('day', timestamp) AS day,
    team_id,
    app_id,
    provider,
    model,
    COUNT(*) AS request_count,
    SUM(input_tokens) AS total_input_tokens,
    SUM(output_tokens) AS total_output_tokens,
    SUM(cost_usd) AS total_cost_usd,
    AVG(latency_ms) AS avg_latency_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99_latency_ms,
    SUM(CASE WHEN was_fallback THEN 1 ELSE 0 END) AS fallback_count,
    SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) AS cache_hit_count
FROM llm_usage_log
GROUP BY 1, 2, 3, 4, 5;

-- Query: monthly cost by team with trend
SELECT
    team_id,
    SUM(total_cost_usd) AS monthly_cost,
    SUM(request_count) AS monthly_requests,
    ROUND(SUM(total_cost_usd) / SUM(request_count) * 1000, 2) AS cost_per_1k_requests,
    ROUND(SUM(cache_hit_count)::NUMERIC / SUM(request_count) * 100, 1) AS cache_hit_pct
FROM daily_cost_by_team
WHERE day >= DATE_TRUNC('month', NOW())
GROUP BY team_id
ORDER BY monthly_cost DESC;

速率限制

令牌桶限流器

// src/rate-limiter/token-bucket.ts
class TokenBucketLimiter {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,      // Bucket capacity
    private refillRate: number,     // Tokens added per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  tryConsume(cost: number = 1): boolean {
    this.refill();

    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }

    return false;
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate,
    );
    this.lastRefill = now;
  }

  getWaitTime(cost: number = 1): number {
    this.refill();
    if (this.tokens >= cost) return 0;
    return Math.ceil((cost - this.tokens) / this.refillRate * 1000);
  }
}

// Per-provider rate limits (matching their documented limits)
const providerLimits = {
  openai: {
    rpm: new TokenBucketLimiter(500, 500 / 60),          // 500 RPM
    tpm: new TokenBucketLimiter(200_000, 200_000 / 60),  // 200K TPM
  },
  anthropic: {
    rpm: new TokenBucketLimiter(1000, 1000 / 60),
    tpm: new TokenBucketLimiter(400_000, 400_000 / 60),
  },
  google: {
    rpm: new TokenBucketLimiter(1000, 1000 / 60),
    tpm: new TokenBucketLimiter(4_000_000, 4_000_000 / 60),
  },
};

生产部署参考架构

                    ┌─────────────┐
                    │ CloudFlare  │
                    │   (CDN)     │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Nginx     │
                    │ (TLS终止)   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼────┐ ┌────▼─────┐ ┌────▼─────┐
        │ Gateway  │ │ Gateway  │ │ Gateway  │
        │ Pod #1   │ │ Pod #2   │ │ Pod #3   │
        └─────┬────┘ └────┬─────┘ └────┬─────┘
              │            │            │
        ┌─────▼────────────▼────────────▼─────┐
        │              Redis                   │
        │  (Rate Limits + Semantic Cache)      │
        └─────┬────────────┬──────────────────┘
              │            │
        ┌─────▼────┐ ┌────▼──────┐
        │PostgreSQL│ │ClickHouse │
        │(Config)  │ │(Analytics)│
        └──────────┘ └───────────┘

总结

统一接口，多后端：业务代码只需感知场景和质量档位，不需要关心具体 Provider。
Fallback 是必需品：任何单一 Provider 都会宕机，Fallback 链确保服务可用性。
熔断器防止雪崩：快速识别故障 Provider，避免超时等待浪费用户时间。
成本可见才能可控：逐请求记录 Token 用量和成本，建立团队/应用维度的成本看板。
缓存是最便宜的推理：语义缓存可以将重复查询的成本降为零，命中率 30% 就能节省可观成本。

Maurice | [email protected]

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

AI 网关设计：模型路由与负载均衡 — ppt

这是一份基于您提供的文章《AI 网关设计：模型路由与负载均衡》生成的 PPT 大纲，共包含 6 张幻灯片。

AI 网关的引入与核心价值

业务痛点：多模型混用会导致代码紧耦合、切换困难和成本失控等严重问题 [1]。
网关定位：作为统一接入层，彻底解耦业务应用与底层各家模型提供商 [1]。
极简接入：业务代码只需感知应用场景和质量档位，无需关心具体的底层模型实现和接口差异 [2]。
功能聚合：集中解决模型路由、负载均衡、容错保护和成本可观测性等横切关注点 [1]。

生产级 AI 网关核心架构

整体拓扑：系统自上而下划分为业务应用层、网关层（含 Fallback 管理），最后对接 OpenAI、Anthropic 等模型厂商 [1, 3]。
五大核心模块：由路由器、限流器、缓存层、可观测性和 Fallback 链管理共同构成网关枢纽 [1, 3]。
严苛的性能指标：要求路由决策耗时 <1ms、故障切换 <100ms、缓存命中率保持在 30% 以上 [3]。
推荐部署方案：在 Gateway 集群后挂载 Redis（处理限流与语义缓存）、PostgreSQL（存储配置）与 ClickHouse（负责分析） [2]。

智能路由策略：场景与成本双驱动

场景路由机制：路由器会根据聊天、代码生成、视觉或向量化（Embedding）等不同任务智能匹配合适的模型 [3, 4]。
多档位动态分级：为同一应用场景划分 Premium（高优）、Balanced（均衡）与 Fast（极速）三种不同的响应质量档位 [3, 4]。
成本感知路由：基于各类模型的 Input/Output Token 单价精准预估单次请求的消耗 [5]。
预算兜底与降级：系统能自动筛选符合设定最高预算的候选模型，并在高优模型不可用时自动向低成本档位降级请求 [5]。

高可用保障：Fallback 链与熔断机制

容错刚需：任何单一提供商均有宕机风险，因此 Fallback 链是确保大模型服务高可用性的必需品 [2]。
三态熔断管理：采用经典的 Closed（正常）、Open（断开/拦截）和 Half-Open（半开/恢复测试）三种状态来动态保护系统 [5]。
防止服务雪崩：通过熔断器快速识别并拦截故障提供商，避免因全链路超时等待而浪费大量资源与时间 [2, 5]。
无缝容错切换：请求失败后会自动记录延迟与错误明细，顺延并在 Fallback 链上透明切换至下一个可用备选提供商 [6]。

Token 计量与多维成本追踪

双重精准计量：优先采用大模型厂商返回的真实 Token 用量，缺失时使用本地 Tokenizer 进行兜底估算 [7, 8]。
全生命周期观测：详尽记录首字延迟 (TTFT)、总响应耗时、缓存命中情况以及 Fallback 链的完整执行历史 [8, 9]。
精细化成本归属：建立数据视图，精准追踪到单一团队 (Team) 与特定应用 (App) 维度的 Token 消费及日/月趋势 [8, 9]。
成本控制闭环：践行“成本可见才能可控”的理念，利用详细聚合数据看板彻底摆脱糊涂账 [2, 9]。

速率控制与缓存优化策略

令牌桶流控算法：采用高性能的令牌桶算法限制调用频率，避免超过各大厂商设定的速率上限 [10]。
双维度并发限制：为各个提供商精准配置并同步控制 RPM（每分钟请求数）和 TPM（每分钟 Token 数） [2, 10]。
多级缓存架构：利用语义缓存与精确匹配缓存机制有效拦截重复请求，降低底层的并发压力 [2, 3]。
零成本的极致推理：缓存被视为最便宜的推理，只需达到 30% 的缓存命中率即可将海量重复查询带来的成本降为零 [2]。

博客摘要 + 核心看点点击展开

AI 网关设计：模型路由与负载均衡 — summary

SEO 友好博客摘要（约 150 字）

当企业接入多个大语言模型时，硬编码API会导致代码紧耦合与成本失控[1]。本文系统解析生产级AI网关的架构设计，提供统一接入方案[1]。内容核心涵盖了基于任务场景与预算的智能模型路由、保障服务高可用的Fallback故障转移与熔断机制，以及精准的Token成本计量与语义缓存策略[2-4]。该架构不仅能大幅提升系统可观测性，还能有效降低API调用开销并防止系统雪崩[5]。

3 条核心看点