Upload folder using ModelScope SDK

This commit is contained in:
Cherrytest
2025-06-21 18:08:30 +00:00
parent 7e2c993c84
commit e41c819c17
28 changed files with 9914 additions and 40 deletions

View File

@ -0,0 +1,165 @@
---
base_model:
- moonshotai/Kimi-VL-A3B-Instruct
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
---
> [!Note]
> This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
<div align="center">
<img width="30%" src="figures/logo.png">
</div>
<div align="center">
<a href="https://arxiv.org/abs/2504.07491">
<b>📄 Tech Report</b>
</a> &nbsp;|&nbsp;
<a href="https://github.com/MoonshotAI/Kimi-VL">
<b>📄 Github</b>
</a> &nbsp;|&nbsp;
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking-2506/">💬 Chat Web</a>
</div>
## 1. Introduction
Two months after the initial release of our first open-source multimodal reasoning model, [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), we update a regular improved version, [Kimi-VL-A3B-Thinking-2506](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506). Compared to the previous version, this new 2506 version provides several new or improved abilities:
- **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.2 on MMMU-Pro (+3.2), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
- **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
- **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
- **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
In this blog, we provide simple examples to use this new model on image reasoning, video reasoning, and OS-agent scenarios.
## 2. Performance
## 3. Usage
### 3.1. Inference with VLLM (recommended)
As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
```shell
MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
```
> [!Note]
> It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
```python
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
model_path,
trust_remote_code=True,
max_num_seqs=8,
max_model_len=131072,
limit_mm_per_prompt={"image": 256}
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
import requests
from PIL import Image
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)
messages = [
{"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
```
### 3.2. Inference with 🤗 Hugging Face Transformers
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
```python
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_paths = ["url"]
images = [Image.open(path) for path in image_paths]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path} for image_path in image_paths
] + [{"type": "text", "text": ""What kind of cat is this? Answer with one word."}],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
## 4. Citation
```
@misc{kimiteam2025kimivltechnicalreport,
title={{Kimi-VL} Technical Report},
author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
year={2025},
eprint={2504.07491},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07491},
}
```

253
README.md
View File

@ -1,47 +1,220 @@
--- ---
license: Apache License 2.0 base_model:
- moonshotai/Kimi-VL-A3B-Instruct
#model-type: license: mit
##如 gpt、phi、llama、chatglm、baichuan 等 pipeline_tag: image-text-to-text
#- gpt library_name: transformers
#domain:
##如 nlp、cv、audio、multi-modal
#- nlp
#language:
##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
#- cn
#metrics:
##如 CIDEr、Blue、ROUGE 等
#- CIDEr
#tags:
##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
#- pretrained
#tools:
##如 vllm、fastchat、llamacpp、AdaSeq 等
#- vllm
--- ---
### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重,可浏览“模型文件”页面获取。
#### 您可以通过如下git clone命令或者ModelScope SDK来下载模型
SDK下载 > [!Note]
```bash > This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
#安装ModelScope
pip install modelscope <div align="center">
<img width="80%" src="figures/logo.png">
</div>
<div align="center">
<a href="https://arxiv.org/abs/2504.07491">
<b>📄 Tech Report</b>
</a> &nbsp;|&nbsp;
<a href="https://github.com/MoonshotAI/Kimi-VL">
<b>📄 Github</b>
</a> &nbsp;|&nbsp;
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking-2506/">💬 Chat Web</a>
</div>
## 1. Introduction
This is an updated version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), with following improved abilities:
- **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.3), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
- **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
- **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
- **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
## 2. Performance
Comparison with efficient models and two previous versions of Kimi-VL:
<div align="center">
| Benchmark (Metric) | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
|----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
| **General Multimodal** | | | | | | |
| MMBench-EN-v1.1 (Acc) | 83.1 | 83.2 | 74.6 | 82.9 | 76.0 | **84.4** |
| RealWorldQA (Acc) | 75.4 | 68.5 | 59.1 | 68.1 | 64.0 | **70.0** |
| OCRBench (Acc) | 815 | 864 | 702 | 864 | 864 | **869** |
| MMStar (Acc) | 64.7 | 63.0 | 56.1 | 61.7 | 64.2 | **70.4** |
| MMVet (Acc) | 69.1 | 67.1 | 64.9 | 66.7 | 69.5 | **78.1** |
| **Reasoning** | | | | | | |
| MMMU (val, Pass@1) | 69.1 | 58.6 | 59.6 | 57.0 | 61.7 | **64.0** |
| MMMU-Pro (Pass@1) | 51.7 | 38.1 | 32.1 | 36.0 | 43.2 | **46.3** |
| **Math** | | | | | | |
| MATH-Vision (Pass@1) | 30.4 | 25.0 | 32.1 | 21.7 | 36.8 | **56.9** |
| MathVista_MINI (Pass@1) | 63.8 | 68.0 | 56.1 | 68.6 | 71.7 | **80.1** |
| **Video** | | | | | | |
| VideoMMMU (Pass@1) | 61.2 | 47.4 | 57.0 | 52.1 | 55.5 | **65.2** |
| MMVU (Pass@1) | 67.4 | 50.1 | 57.0 | 52.7 | 53.0 | **57.5** |
| Video-MME (w/ sub.) | 77.2 | 71.6 | 62.1 | **72.7** | 66.0 | 71.9 |
| **Agent Grounding** | | | | | | |
| ScreenSpot-Pro (Acc) | 0.8 | 29.0 | — | 35.4 | — | **52.8** |
| ScreenSpot-V2 (Acc) | 18.1 | 84.2 | — | **92.8** | — | 91.4 |
| OSWorld-G (Acc) | - | 31.5 | — | 41.6 | — | **52.5** |
| **Long Document** | | | | | | |
| MMLongBench-DOC (Acc) | 42.8 | 29.6 | 21.3 | 35.1 | 32.5 | **42.1** |
</div>
Comparison with 30B-70B open-source models:
<div align="center">
| Benchmark (Metric) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Qwen2.5-VL-72B | Gemma3-27B-IT |
|----------------------------|---------------------------|---------------|---------------|---------------|
| **General Multimodal** | | | | |
| MMBench-EN-v1.1 (Acc) | 84.4 | - | 88.3 | 78.9 |
| RealWorldQA (Acc) | 70.0 | - | 75.7 | 62.5 |
| OCRBench (Acc) | 869 | - | 885 | 753 |
| MMStar (Acc) | 70.4 | 69.5 | 70.8 | 63.1 |
| MMVet (Acc) | 78.1 | - | 74.0 | 71.0 |
| **Reasoning** | | | ||
| MMMU (val, Pass@1) | 64.0 | 70.0 | 70.2 | 64.9 |
| MMMU-Pro (Pass@1) | 46.3 | 49.5 | 51.1 | - |
| MATH-Vision (Pass@1) | 56.9 | 38.4 | 38.1 | 35.4 |
| MathVista\_MINI (Pass@1) | 80.1 | 74.7 | 74.8 | 59.8 |
| **Video** | | | | |
| VideoMMMU (Pass@1) | 65.2 | - | 60.2 | 61.8 |
| MMVU (Pass@1) | 57.5 | - | 62.9 | 61.3 |
| Video-MME (w/ sub.) | 71.9 | 70.5/77.9 | 73.3/79.1 | - |
| **Agent Grounding** | | | | |
| ScreenSpot-Pro (Acc) | 52.8 | 39.4 | 43.6 | - |
| ScreenSpot-V2 (Acc) | 91.4 | - | - | - |
| OSWorld-G (Acc) | 52.5 | 46.5 | - | - |
| **Long Document** | | | | |
| MMLongBench-DOC (Acc) | 42.1 | - | 38.8 | - |
</div>
## 3. Usage
### 3.1. Inference with VLLM (recommended)
As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
```shell
MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
``` ```
> [!Note]
> It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
```python ```python
#SDK模型下载 from transformers import AutoProcessor
from modelscope import snapshot_download from vllm import LLM, SamplingParams
model_dir = snapshot_download('moonshotai/Kimi-VL-A3B-Thinking-2506')
``` model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
Git下载 llm = LLM(
``` model_path,
#Git模型下载 trust_remote_code=True,
git clone https://www.modelscope.cn/moonshotai/Kimi-VL-A3B-Thinking-2506.git max_num_seqs=8,
max_model_len=131072,
limit_mm_per_prompt={"image": 256}
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
import requests
from PIL import Image
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)
messages = [
{"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
``` ```
<p style="color: lightgrey;">如果您是本模型的贡献者,我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>,及时完善模型卡片内容。</p>
### 3.2. Inference with 🤗 Hugging Face Transformers
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
```python
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_paths = ["url"]
images = [Image.open(path) for path in image_paths]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path} for image_path in image_paths
] + [{"type": "text", "text": ""What kind of cat is this? Answer with one word."}],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
## 4. Citation
```
@misc{kimiteam2025kimivltechnicalreport,
title={{Kimi-VL} Technical Report},
author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
year={2025},
eprint={2504.07491},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07491},
}
```

31
chat_template.jinja Normal file
View File

@ -0,0 +1,31 @@
{%- for message in messages -%}
{%- if loop.first and messages[0]['role'] != 'system' -%}
{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}
{%- endif -%}
{%- if message['role'] == 'system' -%}
{{'<|im_system|>'}}
{%- endif -%}
{%- if message['role'] == 'user' -%}
{{'<|im_user|>'}}
{%- endif -%}
{%- if message['role'] == 'assistant' -%}
{{'<|im_assistant|>'}}
{%- endif -%}
{{- message['role'] -}}
{{'<|im_middle|>'}}
{%- if message['content'] is string -%}
{{- message['content'] + '<|im_end|>' -}}
{%- else -%}
{%- for content in message['content'] -%}
{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
{{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}
{%- else -%}
{{content['text']}}
{%- endif -%}
{%- endfor -%}
{{'<|im_end|>'}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<|im_assistant|>assistant<|im_middle|>'}}
{%- endif -%}

75
config.json Normal file
View File

@ -0,0 +1,75 @@
{
"architectures": [
"KimiVLForConditionalGeneration"
],
"auto_map": {
"AutoConfig": "configuration_kimi_vl.KimiVLConfig",
"AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
"AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
},
"vision_config": {
"model_type": "moonvit",
"patch_size": 14,
"num_attention_heads": 16,
"num_hidden_layers": 27,
"hidden_size": 1152,
"intermediate_size": 4304,
"init_pos_emb_height": 64,
"init_pos_emb_width": 64,
"merge_kernel_size": [
2,
2
],
"torch_dtype": "bfloat16"
},
"text_config": {
"vocab_size": 163840,
"max_position_embeddings": 131072,
"hidden_size": 2048,
"intermediate_size": 11264,
"moe_intermediate_size": 1408,
"num_hidden_layers": 27,
"num_attention_heads": 16,
"n_shared_experts": 2,
"n_routed_experts": 64,
"ep_size": 1,
"routed_scaling_factor": 2.446,
"kv_lora_rank": 512,
"q_lora_rank": null,
"qk_rope_head_dim": 64,
"v_head_dim": 128,
"qk_nope_head_dim": 128,
"topk_method": "noaux_tc",
"n_group": 1,
"topk_group": 1,
"num_experts_per_tok": 6,
"moe_layer_freq": 1,
"first_k_dense_replace": 1,
"norm_topk_prob": true,
"scoring_func": "sigmoid",
"aux_loss_alpha": 0.001,
"seq_aux": true,
"num_key_value_heads": 16,
"hidden_act": "silu",
"initializer_range": 0.02,
"rms_norm_eps": 1e-05,
"pretraining_tp": 1,
"use_cache": true,
"rope_theta": 800000.0,
"rope_scaling": null,
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 163584,
"pad_token_id": 163839,
"eos_token_id": 163585,
"torch_dtype": "bfloat16",
"tie_word_embeddings": false
},
"ignore_index": -100,
"media_placeholder_token_id": 163605,
"torch_dtype": "bfloat16",
"transformers_version": "4.50.3",
"tie_word_embeddings": false,
"vocab_size": 163840,
"model_type": "kimi_vl"
}

1
configuration.json Normal file
View File

@ -0,0 +1 @@
{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}

284
configuration_kimi_vl.py Normal file
View File

@ -0,0 +1,284 @@
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
from typing import Optional, Union
logger = logging.get_logger(__name__)
DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
class DeepseekV3Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekV3Model`]. It is used to instantiate an DeepSeek
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the DeepSeek-V3.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Copy from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/configuration_deepseek.py
Args:
vocab_size (`int`, *optional*, defaults to 129280):
Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`DeepseekV3Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
moe_intermediate_size (`int`, *optional*, defaults to 1407):
Dimension of the MoE representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_nextn_predict_layers (`int`, *optional*, defaults to 1):
Number of nextn predict layers in the DeepSeekV3 Model.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
n_shared_experts (`int`, *optional*, defaults to None):
Number of shared experts, None means dense model.
n_routed_experts (`int`, *optional*, defaults to None):
Number of routed experts, None means dense model.
routed_scaling_factor (`float`, *optional*, defaults to 1.0):
Scaling factor or routed experts.
topk_method (`str`, *optional*, defaults to `gready`):
Topk method used in routed gate.
n_group (`int`, *optional*, defaults to None):
Number of groups for routed experts.
topk_group (`int`, *optional*, defaults to None):
Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
num_experts_per_tok (`int`, *optional*, defaults to None):
Number of selected experts, None means dense model.
moe_layer_freq (`int`, *optional*, defaults to 1):
The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
first_k_dense_replace (`int`, *optional*, defaults to 0):
Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
\--k dense layers--/
norm_topk_prob (`bool`, *optional*, defaults to False):
Whether to normalize the weights of the routed experts.
scoring_func (`str`, *optional*, defaults to 'softmax'):
Method of computing expert weights.
aux_loss_alpha (`float`, *optional*, defaults to 0.001):
Auxiliary loss weight coefficient.
seq_aux = (`bool`, *optional*, defaults to True):
Whether to compute the auxiliary loss for each individual sample.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 1):
Beginning of stream token id.
eos_token_id (`int`, *optional*, defaults to 2):
End of stream token id.
pretraining_tp (`int`, *optional*, defaults to 1):
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
issue](https://github.com/pytorch/pytorch/issues/76232).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
`max_position_embeddings` to the expected new maximum.
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
```python
>>> from transformers import DeepseekV3Model, DeepseekV3Config
>>> # Initializing a Deepseek-V3 style configuration
>>> configuration = DeepseekV3Config()
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "deepseek_v3"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=129280,
hidden_size=7168,
intermediate_size=18432,
moe_intermediate_size=2048,
num_hidden_layers=61,
num_nextn_predict_layers=1,
num_attention_heads=128,
num_key_value_heads=128,
n_shared_experts=1,
n_routed_experts=256,
ep_size=1,
routed_scaling_factor=2.5,
kv_lora_rank=512,
q_lora_rank=1536,
qk_rope_head_dim=64,
v_head_dim=128,
qk_nope_head_dim=128,
topk_method="noaux_tc",
n_group=8,
topk_group=4,
num_experts_per_tok=8,
moe_layer_freq=1,
first_k_dense_replace=3,
norm_topk_prob=True,
scoring_func="sigmoid",
aux_loss_alpha=0.001,
seq_aux=True,
hidden_act="silu",
max_position_embeddings=4096,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=None,
bos_token_id=0,
eos_token_id=1,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
attention_bias=False,
attention_dropout=0.0,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.moe_intermediate_size = moe_intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_nextn_predict_layers = num_nextn_predict_layers
self.num_attention_heads = num_attention_heads
self.n_shared_experts = n_shared_experts
self.n_routed_experts = n_routed_experts
self.ep_size = ep_size
self.routed_scaling_factor = routed_scaling_factor
self.kv_lora_rank = kv_lora_rank
self.q_lora_rank = q_lora_rank
self.qk_rope_head_dim = qk_rope_head_dim
self.v_head_dim = v_head_dim
self.qk_nope_head_dim = qk_nope_head_dim
self.topk_method = topk_method
self.n_group = n_group
self.topk_group = topk_group
self.num_experts_per_tok = num_experts_per_tok
self.moe_layer_freq = moe_layer_freq
self.first_k_dense_replace = first_k_dense_replace
self.norm_topk_prob = norm_topk_prob
self.scoring_func = scoring_func
self.aux_loss_alpha = aux_loss_alpha
self.seq_aux = seq_aux
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
class MoonViTConfig(PretrainedConfig):
model_type = "moonvit"
def __init__(
self,
patch_size: int = 14,
init_pos_emb_height: int = 64,
init_pos_emb_width: int = 64,
num_attention_heads: int = 16,
num_hidden_layers: int = 27,
hidden_size: int = 1152,
intermediate_size: int = 4304,
merge_kernel_size: tuple[int, int] = (2, 2),
**kwargs,
):
super().__init__(**kwargs)
self.patch_size = patch_size
# Positional embedding config
self.init_pos_emb_height = init_pos_emb_height
self.init_pos_emb_width = init_pos_emb_width
# Transformer config
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
# Patch merger config
self.merge_kernel_size = merge_kernel_size
class KimiVLConfig(PretrainedConfig):
model_type = "kimi_vl"
def __init__(
self,
vision_config: Optional[Union[dict, MoonViTConfig]] = None,
text_config: Optional[Union[dict, DeepseekV3Config]] = None,
ignore_index: int = -100,
media_placeholder_token_id: int = 163605,
pad_token_id: int = 0,
**kwargs,
):
if vision_config is None:
vision_config = MoonViTConfig()
elif isinstance(vision_config, dict):
vision_config = MoonViTConfig(**vision_config)
self.vision_config = vision_config
if text_config is None:
text_config = DeepseekV3Config()
elif isinstance(text_config, dict):
text_config = DeepseekV3Config(**text_config)
self.text_config = text_config
self.ignore_index = ignore_index
self.media_placeholder_token_id = media_placeholder_token_id
attn_implementation = kwargs.get("attn_implementation")
if attn_implementation is not None:
if attn_implementation in ["eager", "flash_attention_2"]:
self._attn_implementation = attn_implementation
self.vision_config._attn_implementation = attn_implementation
self.text_config._attn_implementation = attn_implementation
else:
raise ValueError(
f"Invalid attention implementation: {attn_implementation}"
)
super().__init__(pad_token_id=pad_token_id, **kwargs)

BIN
figures/arch.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 626 KiB

BIN
figures/demo1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 218 KiB

BIN
figures/demo2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

BIN
figures/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

BIN
figures/screenshot.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 808 KiB

BIN
figures/thinking_perf.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

9
generation_config.json Normal file
View File

@ -0,0 +1,9 @@
{
"bos_token_id": 163584,
"pad_token_id": 163838,
"eos_token_id": [
163585
],
"do_sample": true,
"temperature": 0.6
}

126
image_processing_kimi_vl.py Normal file
View File

@ -0,0 +1,126 @@
"""Image processor class for KimiVL."""
import math
import numpy as np
from PIL import Image
from typing import Optional, Union
import torch
from torchvision.transforms import functional as TF
from transformers.image_utils import ImageInput, make_list_of_images, valid_images
from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
from transformers.utils import TensorType
OPENAI_DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073)
OPENAI_DATASET_STD = (0.26862954, 0.26130258, 0.27577711)
class KimiVLImageProcessor(BaseImageProcessor):
model_type = "kimi_vl"
def __init__(
self,
patch_size: int = 14,
pad_input: bool = False,
image_mean: tuple[float, float, float] = OPENAI_DATASET_MEAN,
image_std: tuple[float, float, float] = OPENAI_DATASET_STD,
in_token_limit: int = 4096,
merge_kernel_size: list[int, int] = [2, 2],
**kwargs,
):
super().__init__(**kwargs)
self.in_token_limit = in_token_limit
self.patch_size = patch_size
self.pad_input = pad_input
self.image_mean = image_mean
self.image_std = image_std
self.merge_kernel_size = merge_kernel_size
def rescale(
self, image: Image.Image, merge_kernel_size: list[int, int] = [2, 2]
) -> Image.Image:
w, h = image.size
patch_size = self.patch_size
if (w // patch_size) * (h // patch_size) > self.in_token_limit:
scale = math.sqrt(self.in_token_limit / ((w // patch_size) * (h // patch_size)))
new_w, new_h = int(w * scale), int(h * scale)
image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
if self.pad_input:
new_w, new_h = image.size
pad_size_h = merge_kernel_size[0] * patch_size
pad_size_w = merge_kernel_size[1] * patch_size
pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
image = TF.pad(image, (0, 0, pad_w, pad_h))
else:
new_w, new_h = image.size
new_w = new_w - new_w % patch_size
new_h = new_h - new_h % patch_size
image = TF.center_crop(image, (new_h, new_w))
w, h = image.size
if w // patch_size >= 512 or h // patch_size >= 512:
raise ValueError("Exceed pos emb")
return image
def to_tensor(self, image: Image.Image) -> torch.Tensor:
return TF.to_tensor(image.convert("RGB"))
def normalize(self, image: torch.Tensor) -> torch.Tensor:
return TF.normalize(image, self.image_mean, self.image_std)
def patchify(self, image: torch.Tensor) -> tuple[torch.Tensor, list[int, int]]:
patch_size = self.patch_size
C, H, W = image.shape
patches = image.reshape(C, H // patch_size, patch_size, W // patch_size, patch_size)
patches = patches.permute(1, 3, 0, 2, 4)
patches = patches.contiguous().view(-1, C, patch_size, patch_size)
grid_hw = (H // patch_size, W // patch_size)
return patches, grid_hw
def _preprocess(self, image: ImageInput) -> tuple[torch.Tensor, list[int, int]]:
"""
Preprocess image and patchify it.
Args:
image (`ImageInput`):
Image to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
Returns:
patches: torch.Tensor
grid_hw: list[int, int]
"""
image = self.rescale(image, self.merge_kernel_size)
image = self.to_tensor(image)
image = self.normalize(image)
patches, grid_hw = self.patchify(image)
return patches, grid_hw
def preprocess(
self,
images: ImageInput,
return_tensors: Optional[Union[str, TensorType]] = None,
) -> BatchFeature:
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
pixel_values, image_grid_hws = [], []
for image in images:
patches, image_grid_hw = self._preprocess(image)
pixel_values.append(patches)
image_grid_hws.append(image_grid_hw)
pixel_values = torch.concat(pixel_values, dim=0)
image_grid_hws = np.array(image_grid_hws)
data = {"pixel_values": pixel_values, "image_grid_hws": image_grid_hws}
return BatchFeature(data=data, tensor_type=return_tensors)

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0fcc575e77d59bbf4504439266d770827a3018b56e2eecf4ab5a218c944d6326
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2b830e3948efd3257f27de6b2960c64a6704dafed27ebf6f53b9e1af3a859fe2
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:729e85635b6f8c04a5a50e1330b6d190651f9a5901a34d3ee7caa0d20dbc6bbb
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2339e10a3d58cb9908660eb5b76ae75b3c0c091a773575083d4920a905ea7a45
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0dd4dd6c7179e98fb69402f7d4719c175ffcfa5f5cdea208fb7be9e2f6204230
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:db8f22f385d5743cabc235eff38ac77ab7c77023f906e8c56b074b56cda67bb7
size 135

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3a4aac88787ee97d9a34627d0c03fc7883b98253a239fd93a5cc82a9c8f0d8a7
size 135

5686
model.safetensors.index.json Normal file

File diff suppressed because it is too large Load Diff

2674
modeling_kimi_vl.py Normal file

File diff suppressed because it is too large Load Diff

20
preprocessor_config.json Normal file
View File

@ -0,0 +1,20 @@
{
"auto_map": {
"AutoImageProcessor": "image_processing_kimi_vl.KimiVLImageProcessor",
"AutoProcessor": "processing_kimi_vl.KimiVLProcessor"
},
"in_token_limit": 16384,
"patch_size": 14,
"num_pooled_tokens": 1024,
"image_mean": [
0.5,
0.5,
0.5
],
"image_std": [
0.5,
0.5,
0.5
],
"pad_input": true
}

170
processing_kimi_vl.py Normal file
View File

@ -0,0 +1,170 @@
# coding=utf-8
# Copyright 2025 The Moonshot Team and HuggingFace Inc. team. All rights reserved.
#
# The code is based on the Qwen2VL processor (qwen2_vl/processing_qwen2_vl.py), but modified for KimiVL.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for KimiVL.
"""
from typing import List, Union
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_utils import ImageInput
from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, _validate_images_text_input_order
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
from transformers.utils import logging
logger = logging.get_logger(__name__)
class KimiVLProcessorKwargs(ProcessingKwargs, total=False):
_defaults = {
"text_kwargs": {
"padding": False,
},
"images_kwargs": {},
}
class KimiVLProcessor(ProcessorMixin):
r"""
Constructs a KimiVL processor which wraps a KimiVL image processor and a tokenizer into a single processor.
[`KimiVLProcessor`] offers all the functionalities of [`KimiVLImageProcessor`] and [`TikTokenTokenizer`]. See the
[`~KimiVLProcessor.__call__`] and [`~KimiVLProcessor.decode`] for more information.
Args:
image_processor ([`KimiVLImageProcessor`], *optional*):
The image processor is a required input.
tokenizer ([`TikTokenTokenizer`], *optional*):
The tokenizer is a required input.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = [ "chat_template"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor=None,
tokenizer=None,
chat_template=None,
**kwargs,
):
self.image_token = "<|media_pad|>"
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
images: ImageInput = None,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
**kwargs: Unpack[KimiVLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to TikTokenTokenizer's [`~TikTokenTokenizer.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the docstring
of the above two methods for more information.
Args:
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if images is None and text is None:
raise ValueError("You have to specify at least one of `images` or `text`.")
# check if images and text inputs are reversed for BC
images, text = _validate_images_text_input_order(images, text)
output_kwargs = self._merge_kwargs(
KimiVLProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if images is not None:
image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
image_grid_hws = image_inputs["image_grid_hws"]
else:
image_inputs = {}
image_grid_hws = None
if isinstance(text, str):
text = [text]
elif not isinstance(text, list) and not isinstance(text[0], str):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
if image_grid_hws is not None:
merge_length = self.image_processor.merge_kernel_size[0] * self.image_processor.merge_kernel_size[1]
index = 0
for i in range(len(text)):
while self.image_token in text[i]:
text[i] = text[i].replace(
self.image_token,
"<|placeholder|>" * (image_grid_hws[index].prod() // merge_length),
1,
)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.image_token)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
return BatchFeature(data={**text_inputs, **image_inputs})
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = ["KimiVLProcessorKwargs"]

3
tiktoken.model Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e065f594aa8ac9c7c583e31626a98bc910072b59f7c24fe9f5980653bdf936ae
size 132

302
tokenization_moonshot.py Normal file
View File

@ -0,0 +1,302 @@
import os
import tiktoken
from logging import getLogger
from pathlib import Path
from typing import (
cast,
Tuple,
Dict,
Iterator,
List,
Union,
Optional,
)
from shutil import copyfile
from tiktoken.load import load_tiktoken_bpe
from tokenizers import AddedToken
from transformers.tokenization_utils import PreTrainedTokenizer
from transformers.utils import to_py_obj
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
logger = getLogger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "tiktoken.model"}
SPIECE_UNDERLINE = ""
class TikTokenTokenizer(PreTrainedTokenizer):
"""
Tokenizing and encoding/decoding text using the Tiktoken tokenizer. See megatron/tokenizer/tiktoken_tokenizer.py.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
The path to the Tiktoken model file.
bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|begin_of_text|>",`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|end_of_text|>"`):
The end of sequence token.
unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_249|>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. The second to last item in special_tokens.
pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_250|>"`):
The token used for padding, for example when batching sequences of different lengths.
additional_special_tokens (list of `str`, *optional*):
A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
skipped when decoding if `skip_special_tokens` is set to `True`.
"""
vocab_files_names = VOCAB_FILES_NAMES
model_input_names = ["input_ids", "attention_mask"]
special_tokens: Dict[str, int]
num_reserved_special_tokens = 256
pat_str = "|".join(
[
r"""[\p{Han}]+""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""\p{N}{1,3}""",
r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
r"""\s*[\r\n]+""",
r"""\s+(?!\S)""",
r"""\s+""",
]
)
def __init__(
self,
vocab_file,
bos_token: Union[str, AddedToken] = "[BOS]",
eos_token: Union[str, AddedToken] = "[EOS]",
unk_token: Union[str, AddedToken] = "[UNK]",
pad_token: Union[str, AddedToken] = "[PAD]",
additional_special_tokens: Optional[List[str]] = None,
added_tokens_decoder: Optional[dict] = None,
**kwargs,
):
assert os.path.isfile(vocab_file), vocab_file
if additional_special_tokens is None:
additional_special_tokens = [
"<|im_end|>",
"<|im_middle|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
]
special_tokens_mapping = {
i: added_tokens_decoder[i].content for i in added_tokens_decoder
}
self.vocab_file = vocab_file
mergeable_ranks = load_tiktoken_bpe(vocab_file)
num_base_tokens = len(mergeable_ranks)
self.special_tokens = {
special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
for i in range(
num_base_tokens, num_base_tokens + self.num_reserved_special_tokens + 2
)
}
self.model = tiktoken.Encoding(
name=Path(vocab_file).name,
pat_str=self.pat_str,
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens,
)
self.n_words: int = self.model.n_vocab
# BOS / EOS token IDs
self.bos_id: int = self.special_tokens[str(bos_token)]
self.eos_id: int = self.special_tokens[str(eos_token)]
self.pad_id: int = self.special_tokens[str(pad_token)]
self.unk_id: int = self.special_tokens[str(unk_token)]
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
self.decoder = {}
for i in range(self.n_words):
# Taken from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
decoding = "".join(
[
self.byte_encoder[ord(char)]
for char in self.model.decode_single_token_bytes(i).decode(
"latin-1"
)
]
)
self.decoder[i] = decoding
self.encoder = {}
for i in range(self.n_words):
if i in self.decoder:
self.encoder[self.decoder[i]] = i
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
self.all_special_ids_set = set(self.all_special_ids)
def encode(
self, text: str, allow_special_tokens: bool = True, **kwargs
) -> List[int]:
"""
Encodes a string into a list of token IDs.
Args:
text (str): The input string to be encoded.
Returns:
list[int]: A list of token IDs.
"""
# If there are other args, we should call super().encode because there are a lot of code
# to handle those args. supper().encode finally will call _tokenize and _convert_token_to_id.
if len(kwargs) > 0:
return super().encode(text, **kwargs)
assert type(text) is str
# The tiktoken tokenizer can handle <=400k chars without
# pyo3_runtime.PanicException.
TIKTOKEN_MAX_ENCODE_CHARS = 400_000
# https://github.com/openai/tiktoken/issues/195
# Here we iterate over subsequences and split if we exceed the limit
# of max consecutive non-whitespace or whitespace characters.
MAX_NO_WHITESPACES_CHARS = 25_000
substrs = (
substr
for i in range(0, len(text), TIKTOKEN_MAX_ENCODE_CHARS)
for substr in self._split_whitespaces_or_nonwhitespaces(
text[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
)
)
t: List[int] = []
for substr in substrs:
if allow_special_tokens:
t.extend(
# we should consider special token as a common token
self.model.encode(
substr,
allowed_special="all",
)
)
else:
t.extend(
# we should consider special token as a common token
self.model.encode(
substr,
disallowed_special=(),
)
)
return t
def decode(self, token_ids: Union[int, List[int]], **kwargs) -> str:
"""
Decodes a list of token IDs into a string.
Args:
t (List[int]): The list of token IDs to be decoded.
Returns:
str: The decoded string.
"""
# If there are other args, we should call super().decode because there are a lot of code
# to handle those args. supper().encode finally will call convert_tokens_to_string and _convert_id_to_token.
if len(kwargs) > 0:
return super().decode(token_ids, **kwargs)
token_ids = to_py_obj(token_ids)
if type(token_ids) is int:
token_ids = [token_ids]
return self.model.decode(cast(List[int], token_ids))
@staticmethod
def _split_whitespaces_or_nonwhitespaces(
s: str, max_consecutive_slice_len: int
) -> Iterator[str]:
"""
Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
consecutive whitespaces or consecutive non-whitespaces.
"""
current_slice_len = 0
current_slice_is_space = s[0].isspace() if len(s) > 0 else False
slice_start = 0
for i in range(len(s)):
is_now_space = s[i].isspace()
if current_slice_is_space ^ is_now_space:
current_slice_len = 1
current_slice_is_space = is_now_space
else:
current_slice_len += 1
if current_slice_len > max_consecutive_slice_len:
yield s[slice_start:i]
slice_start = i
current_slice_len = 1
yield s[slice_start:]
""" ----- Below are the abstract methods required by PreTrainedTokenizer ----- """
@property
def vocab_size(self) -> int:
return self.n_words
def get_vocab(self) -> Dict[str, int]:
return self.encoder
def _tokenize(self, text: str, **kwargs) -> List[str]:
return [self.decoder[t] for t in self.encode(text)]
def _convert_token_to_id(self, token: str) -> int:
return self.encoder.get(token, self.unk_id)
def _convert_id_to_token(self, index: int) -> str:
return self.decoder.get(index)
@staticmethod
def clean_up_tokenization(out_string: str) -> str:
return out_string
def convert_tokens_to_string(self, tokens: List[str]) -> str:
text = "".join(tokens).replace(SPIECE_UNDERLINE, "")
text = bytearray([self.byte_decoder[c] for c in text]).decode(
"utf-8", "replace"
)
return text
def save_vocabulary(
self, save_directory: str, filename_prefix: Optional[str] = None
) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
out_vocab_file = os.path.join(
save_directory,
(filename_prefix + "-" if filename_prefix else "")
+ VOCAB_FILES_NAMES["vocab_file"],
)
if os.path.abspath(self.vocab_file) != os.path.abspath(
out_vocab_file
) and os.path.isfile(self.vocab_file):
copyfile(self.vocab_file, out_vocab_file)
return (out_vocab_file,)

134
tokenizer_config.json Normal file
View File

@ -0,0 +1,134 @@
{
"added_tokens_decoder": {
"163584": {
"content": "[BOS]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163585": {
"content": "[EOS]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163586": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163601": {
"content": "<|im_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163587": {
"content": "<|im_user|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163588": {
"content": "<|im_assistant|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163594": {
"content": "<|im_system|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163602": {
"content": "<|media_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163603": {
"content": "<|media_content|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163604": {
"content": "<|media_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163605": {
"content": "<|media_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163838": {
"content": "[PAD]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"163839": {
"content": "[UNK]",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_end|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
"<|media_start|>",
"<|media_content|>",
"<|media_end|>",
"<|media_pad|>"
],
"bos_token": "[BOS]",
"clean_up_tokenization_spaces": false,
"eos_token": "[EOS]",
"extra_special_tokens": {},
"model_max_length": 1048576,
"pad_token": "[PAD]",
"unk_token": "[UNK]",
"tokenizer_class": "TikTokenTokenizer",
"chat_template": "{%- for message in messages -%}{%- if loop.first and messages[0]['role'] != 'system' -%}{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}{%- endif -%}{%- if message['role'] == 'system' -%}{{'<|im_system|>'}}{%- endif -%}{%- if message['role'] == 'user' -%}{{'<|im_user|>'}}{%- endif -%}{%- if message['role'] == 'assistant' -%}{{'<|im_assistant|>'}}{%- endif -%}{{- message['role'] -}}{{'<|im_middle|>'}}{%- if message['content'] is string -%}{{- message['content'] + '<|im_end|>' -}}{%- else -%}{%- for content in message['content'] -%}{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}{{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}{%- else -%}{{content['text']}}{%- endif -%}{%- endfor -%}{{'<|im_end|>'}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{'<|im_assistant|>assistant<|im_middle|>'}}{%- endif -%}",
"auto_map": {
"AutoTokenizer": [
"tokenization_moonshot.TikTokenTokenizer",
null
]
}
}