MiniMax Text01/M1 模型 Transformers 部署指南

本文档将帮助您使用 Transformers库部署和运行 MiniMax-M1 模型

本文档适用模型

本文档适用以下模型，只需在部署时修改模型名称即可。注意 Transformers 适用的模型仓库名称带有 hf 后缀！ 与无 hf 后缀的模型相比，仅 config.json 文件存在差异，权重文件一致。以下以 MiniMax-M1-40k-hf 为例说明部署流程。

环境准备

Python：3.9+

建议使用虚拟环境（如 venv、conda、uv）以避免依赖冲突；
请执行以下命令安装 Transformers、torch 及相关依赖。

# 使用 CUDA 12.8
# 使用 pip 安装
pip install transformers torch accelerate --extra-index-url https://download.pytorch.org/whl/cu128
# 或者使用 uv 安装
uv pip install transformers torch accelerate --torch-backend=auto

使用 Python 运行

请确保已正确安装所需依赖，并将 CUDA 驱动配置好。以下代码演示如何使用 Transformers 加载并运行 MiniMax-M1 模型：

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

MODEL_PATH = "MiniMaxAI/MiniMax-M1-40k-hf"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

messages = [
    {"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
    {"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
    {"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)

response = tokenizer.batch_decode(generated_ids)[0]

print(response)

使用 Flash Attention 加速推理

Flash Attention 是一种高效的注意力机制实现，可以加速模型推理过程。需确保 GPU 支持 Flash Attention，部分老旧显卡可能不兼容。首先我们安装 flash_attn 包。

# 使用 pip 安装
pip install flash_attn --no-build-isolation
# 或者使用 uv 安装
uv pip install flash_attn --torch-backend=auto --no-build-isolation

使用 Flash Attention-2 加载并运行 MiniMax-M1 模型，只需在 from_pretrained 时增加如下参数：

 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
     device_map="auto",
     trust_remote_code=True,
     torch_dtype=torch.float16, # 新增参数
     attn_implementation="flash_attention_2"  # 新增参数
 )

获取支持

如果在部署 MiniMax 模型的过程中遇到任何问题，您可通过以下方式：

通过官方邮箱 api@minimaxi.com 联系我们的技术支持团队
在我们的 GitHub 仓库提交 Issue

我们将持续优化 Transformers 上的部署体验，欢迎您的反馈！

开始使用

使用指南

常见问题

条款与政策

本文档适用模型

环境准备

使用 Python 运行

使用 Flash Attention 加速推理

获取支持

开始使用

使用指南

常见问题

条款与政策

​本文档适用模型

​环境准备

​使用 Python 运行

​使用 Flash Attention 加速推理

​获取支持

本文档适用模型

环境准备

使用 Python 运行

使用 Flash Attention 加速推理

获取支持