功能特性

自动缓存：被动缓存，自动识别重复的上下文内容，无需更改接口调用方式（相对的，在 anthropic API 中使用的需要显式设置参数的缓存模式，我们称之为「主动缓存」，详见Anthropic 主动缓存）
降低成本：命中缓存的输入 Token 以更低价格计费，大幅节省成本
提升速度：减少重复内容的处理时间，加快模型响应

这种机制特别适用于以下场景：

系统提示词复用：在多轮对话中，系统提示词通常保持不变；
固定的工具清单：在一类任务中使用的工具往往是固定的；
多轮对话历史：在复杂的对话中，历史消息往往包含大量重复信息；

满足以上条件的场景，均可利用缓存机制有效节约 tokens 消耗，加快响应速度。

代码示例

Anthropic SDK 示例
OpenAI SDK 示例

安装 SDK

pip install anthropic

环境变量设置国内用户使用 https://api.minimaxi.com/v1，国际用户使用 https://api.minimax.io/v1

export ANTHROPIC_BASE_URL=https://api.minimaxi.com/anthropic
export ANTHROPIC_API_KEY=${YOUR_API_KEY}

第一次请求 - 建立缓存

import anthropic

client = anthropic.Anthropic()

response1 = client.messages.create(
    model="MiniMax-M2.5",
    system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
    messages=[
        {
            "role": "user", 
            "content": [
                {
                    "type": "text",
                    "text": "<the entire contents of 'Pride and Prejudice'>"
                }
            ]
        },
    ],
    max_tokens=10240,
)

print("第一次请求结果:")
for block in response1.content:
    if block.type == "thinking":
        print(f"思考:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"输出:\n{block.text}\n")
print(f"输入 Token: {response1.usage.input_tokens}")
print(f"输出 Token: {response1.usage.output_tokens}")
print(f"命中缓存 Token: {response1.usage.cache_read_input_tokens}")

第二次请求 - 复用缓存

response2 = client.messages.create(
    model="MiniMax-M2.5",
    system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
    messages=[
        {
            "role": "user", 
            "content": [
                {
                    "type": "text",
                    "text": "<the entire contents of 'Pride and Prejudice'>"
                }
            ]
        },
    ],
    max_tokens=10240,
)

print("\n第二次请求结果:")
for block in response2.content:
    if block.type == "thinking":
        print(f"思考:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"输出:\n{block.text}\n")
print(f"输入 Token: {response2.usage.input_tokens}")
print(f"输出 Token: {response2.usage.output_tokens}")
print(f"命中缓存 Token: {response2.usage.cache_read_input_tokens}")

响应包含上下文缓存的 Token 使用信息：

{
    "usage": {
        "input_tokens": 108,
        "output_tokens": 91,
        "cache_creation_input_tokens": 0,
        "cache_read_input_tokens": 14813
    }
}

安装 SDK

pip install openai

环境变量设置国内用户使用 https://api.minimaxi.com/v1，国际用户使用 https://api.minimax.io/v1

export OPENAI_BASE_URL=https://api.minimaxi.com/v1
export OPENAI_API_KEY=${YOUR_API_KEY}

第一次请求 - 建立缓存

from openai import OpenAI

client = OpenAI()

response1 = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"},
        {"role": "user", "content": "<the entire contents of 'Pride and Prejudice'>"},
    ],
    # 设置 reasoning_split=True 将思考内容分离到 reasoning_details 字段
    extra_body={"reasoning_split": True},
)

print("第一次请求结果:")
print(f"回复: {response1.choices[0].message.content}")
print(f"总 Token: {response1.usage.total_tokens}")
print(f"缓存 Token: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")

第二次请求 - 复用缓存

response2 = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[
        {"role": "system", "content": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"},
        {"role": "user", "content": "<the entire contents of 'Pride and Prejudice'>"},
    ],
    # 设置 reasoning_split=True 将思考内容分离到 reasoning_details 字段
    extra_body={"reasoning_split": True},
)

print("\n第二次请求结果:")
print(f"回复: {response2.choices[0].message.content}")
print(f"总 Token: {response2.usage.total_tokens}")
print(f"缓存 Token: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")

响应包含上下文缓存的 Token 使用信息：

{
    "usage": {
        "prompt_tokens": 1200,
        "completion_tokens": 300,
        "total_tokens": 1500,
        "prompt_tokens_details": {
            "cached_tokens": 800
        }
    }
}

注意事项

缓存适用于包含 512 个及以上的输入 token 数量的 API 调用；
缓存采用前缀匹配的方式，以「工具定义-系统提示词-历史对话内容」为顺序构建，任意模块内容的变更，都可能会影响缓存的效果；

最佳实践

在对话的开头部分放置静态的或重复的内容（包括工具定义、系统提示词，历史对话内容），将动态的用户信息放在对话的最后，以最大程度利用 cache;
通过API返回的 usage tokens 数量，来监测缓存性能，定期分析以优化你的使用策略；

计费说明

prompt 缓存采用差异化的计费策略：

缓存命中 Token：按优惠价格计费
新增的输入 Token: 按标准输入价格计费
输出 Token：按标准输出价格计费

详见价格页

计费示例：

假设标准输入价格为 10 元/1M Token， 标准输出价格为 40 元/1M Token， 命中缓存价格为 1 元/1M Token：

单次请求 token 用量详情：
- 总输入 Token： 50000
- 缓存命中 Token： 45000
- 新增输入内容 Token： 5000
- 输出 Token： 1000

计费计算：
- 新增输入内容费用： 5000 × 10/1000000 = 0.05 元
- 缓存费用： 45000 × 1/1000000 = 0.045 元
- 输出费用： 1000 × 40/1000000 = 0.04 元
- 总费用：0.05+0.045+0.04 = 0.135 元

相比无缓存 （50000 × 10/1000000 + 1000 × 40/1000000 = 0.54 元），节省 75%

Cache 对比

	Prompt 缓存（被动缓存）	Anthropic 主动缓存
使用方式	自动识别重复内容并缓存	在API中显式设置 cache_control
计费方式	命中缓存的token以优惠价格进行计费写入缓存的部分无额外计费	命中缓存的token以优惠价格进行计费首次写入缓存的token需要额外计费
缓存过期	根据系统负载自动调整过期时间	5min过期时间，持续使用会自动续期
支持模型	MiniMax-M2.5 系列 MiniMax-M2.1 系列	MiniMax-M2.5 系列 MiniMax-M2.1 系列 MiniMax-M2 系列

开始使用

使用指南

常见问题

条款与政策

Prompt 缓存

功能特性

代码示例

注意事项

最佳实践

计费说明

Cache 对比

更多阅读

Anthropic 主动缓存

开始使用

使用指南

常见问题

条款与政策

​功能特性

​代码示例

​注意事项

​最佳实践

​计费说明

​Cache 对比

​更多阅读

Anthropic 主动缓存

功能特性

代码示例

注意事项

最佳实践

计费说明

Cache 对比

更多阅读