Upload folder using ModelScope SDK

2026-07-16 10:42:54 +08:00 · 2025-06-21 18:08:30 +00:00
parent 7e2c993c84
commit e41c819c17
28 changed files with 9914 additions and 40 deletions
--- a/.ipynb_checkpoints/README-checkpoint.md
+++ b/.ipynb_checkpoints/README-checkpoint.md
@ -0,0 +1,165 @@
 ---
 base_model:
 - moonshotai/Kimi-VL-A3B-Instruct
 license: mit
 pipeline_tag: image-text-to-text
 library_name: transformers
 ---
 > [!Note]
 > This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
 <div align="center">
  <img width="30%" src="figures/logo.png">
 </div>
 <div align="center">
  <a href="https://arxiv.org/abs/2504.07491">
    <b>📄 Tech Report</b>
  </a> &nbsp;|&nbsp;
  <a href="https://github.com/MoonshotAI/Kimi-VL">
    <b>📄 Github</b>
  </a> &nbsp;|&nbsp;
  <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking-2506/">💬 Chat Web</a>
 </div>
 ## 1. Introduction
 Two months after the initial release of our first open-source multimodal reasoning model, [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), we update a regular improved version, [Kimi-VL-A3B-Thinking-2506](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506). Compared to the previous version, this new 2506 version provides several new or improved abilities:
 - **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.2 on MMMU-Pro (+3.2), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
 - **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
 - **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
 - **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
 In this blog, we provide simple examples to use this new model on image reasoning, video reasoning, and OS-agent scenarios.
 ## 2. Performance
 ## 3. Usage
 ### 3.1. Inference with VLLM (recommended)
 As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
 ```shell
 MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
 ```
 > [!Note]
 > It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
 ```python
 from transformers import AutoProcessor
 from vllm import LLM, SamplingParams
 model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
 llm = LLM(
    model_path,
    trust_remote_code=True,
    max_num_seqs=8,
    max_model_len=131072,
    limit_mm_per_prompt={"image": 256}
 )
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
 import requests
 from PIL import Image
 def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text
 OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
 url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
 image = Image.open(requests.get(url,stream=True).raw)
 messages = [
    {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
 ]
 text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
 outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
 generated_text = outputs[0].outputs[0].text
 thinking, summary = extract_thinking_and_summary(generated_text)
 print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
 ```
 ### 3.2. Inference with 🤗 Hugging Face Transformers 
 We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment. 
 ```python
 from PIL import Image
 from transformers import AutoModelForCausalLM, AutoProcessor
 def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text
 OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
 url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
 model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
 model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
 )
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 image_paths = ["url"]
 images = [Image.open(path) for path in image_paths]
 messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": ""What kind of cat is this? Answer with one word."}],
    },
 ]
 text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
 inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
 generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
 generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
 response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]
 print(response)
 ```
 ## 4. Citation
 ```
@misc{kimiteam2025kimivltechnicalreport,
      title={{Kimi-VL} Technical Report}, 
      author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
      year={2025},
      eprint={2504.07491},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07491}, 
 }
 ```
--- a/README.md
+++ b/README.md
@ -1,47 +1,220 @@
 ---
-license: Apache License 2.0
+base_model:
-
+- moonshotai/Kimi-VL-A3B-Instruct
-#model-type:
+license: mit
-##如 gpt、phi、llama、chatglm、baichuan 等
+pipeline_tag: image-text-to-text
-#- gpt
+library_name: transformers
 #domain:
 ##如 nlp、cv、audio、multi-modal
 #- nlp
 #language:
 ##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
 #- cn 
 #metrics:
 ##如 CIDEr、Blue、ROUGE 等
 #- CIDEr
 #tags:
 ##各种自定义，包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
 #- pretrained
 #tools:
 ##如 vllm、fastchat、llamacpp、AdaSeq 等
 #- vllm
 ---
 ### 当前模型的贡献者未提供更加详细的模型介绍。模型文件和权重，可浏览“模型文件”页面获取。
 #### 您可以通过如下git clone命令，或者ModelScope SDK来下载模型
-SDK下载
+> [!Note]
-```bash
+> This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
-#安装ModelScope
+
-pip install modelscope
+<div align="center">
  <img width="80%" src="figures/logo.png">
 </div>
 <div align="center">
  <a href="https://arxiv.org/abs/2504.07491">
    <b>📄 Tech Report</b>
  </a> &nbsp;|&nbsp;
  <a href="https://github.com/MoonshotAI/Kimi-VL">
    <b>📄 Github</b>
  </a> &nbsp;|&nbsp;
  <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking-2506/">💬 Chat Web</a>
 </div>
 ## 1. Introduction
 This is an updated version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), with following improved abilities:
 - **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.3), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
 - **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
 - **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
 - **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
 ## 2. Performance
 Comparison with efficient models and two previous versions of Kimi-VL:
 <div align="center">
 | Benchmark (Metric)         | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
 |----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
 | **General Multimodal**   |        |               |               |                      |                      |                          |
 | MMBench-EN-v1.1 (Acc)          | 83.1   | 83.2          | 74.6          | 82.9                 | 76.0                 | **84.4**                    |
 | RealWorldQA (Acc)          | 75.4   | 68.5          | 59.1          | 68.1                 | 64.0                 | **70.0**                     |
 | OCRBench (Acc)             | 815    | 864           | 702           | 864                  | 864                  | **869**                      |
 | MMStar (Acc)               | 64.7   | 63.0          | 56.1          | 61.7                 | 64.2                 | **70.4**                     |
 | MMVet (Acc)               | 69.1   | 67.1      | 64.9        | 66.7        |    69.5 |   **78.1** |
 | **Reasoning**            |        |               |               |                      |                      |                          |            
 | MMMU (val, Pass@1)         | 69.1   | 58.6          | 59.6          | 57.0                 | 61.7                 | **64.0**                     |
 | MMMU-Pro (Pass@1)          | 51.7   | 38.1          | 32.1          | 36.0                  | 43.2               | **46.3**                    |
 | **Math**                   |        |               |               |                      |                      |                          |
 | MATH-Vision (Pass@1)       | 30.4   | 25.0          | 32.1          | 21.7                | 36.8                | **56.9**                    |
 | MathVista_MINI (Pass@1)    | 63.8   | 68.0          | 56.1          | 68.6                | 71.7                | **80.1**                    |
 | **Video**                  |        |               |               |                      |                      |                          |
 | VideoMMMU (Pass@1)         | 61.2   | 47.4          | 57.0          | 52.1                | 55.5                | **65.2**                    |
 | MMVU (Pass@1)              | 67.4   | 50.1          | 57.0          | 52.7                | 53.0                | **57.5**                    |
 | Video-MME (w/ sub.)        | 77.2   | 71.6          | 62.1          | **72.7**                | 66.0                    | 71.9                    |
 | **Agent Grounding**        |        |               |               |                      |                      |                          |
 | ScreenSpot-Pro (Acc)               | 0.8    | 29.0          | —             | 35.4                | —                    | **52.8**                    |
 | ScreenSpot-V2 (Acc)                | 18.1   | 84.2          | —             | **92.8**                | —                    | 91.4                    |
 | OSWorld-G (Acc)            | -     | 31.5           | —             | 41.6                | —                    | **52.5**                    |
 | **Long Document**          |        |               |               |                      |                      |                          |
 | MMLongBench-DOC (Acc)          | 42.8   | 29.6          | 21.3          | 35.1                | 32.5                | **42.1**                    |
 </div>
 Comparison with 30B-70B open-source models:
 <div align="center">
 | Benchmark (Metric)         | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Qwen2.5-VL-72B | Gemma3-27B-IT |
 |----------------------------|---------------------------|---------------|---------------|---------------|
  | **General Multimodal**   |        |               |               |                      |        
 | MMBench-EN-v1.1 (Acc) | 84.4 | - | 88.3 | 78.9 |
 | RealWorldQA (Acc) | 70.0 | - | 75.7 | 62.5 |
 | OCRBench (Acc) | 869 | - | 885 |  753 |
 | MMStar (Acc) | 70.4 | 69.5 | 70.8 | 63.1 |
 | MMVet (Acc) | 78.1 | - | 74.0 | 71.0 |
 | **Reasoning** | | | ||
 | MMMU (val, Pass@1) | 64.0 | 70.0 | 70.2 | 64.9 |
 | MMMU-Pro (Pass@1) | 46.3 | 49.5 | 51.1 | - |
 | MATH-Vision (Pass@1) | 56.9 | 38.4 | 38.1 | 35.4 |
 | MathVista\_MINI (Pass@1) | 80.1 | 74.7 | 74.8 | 59.8 |
 | **Video** | | | | |
 | VideoMMMU (Pass@1) | 65.2 | - | 60.2 | 61.8 |
 | MMVU (Pass@1) | 57.5 | - | 62.9 | 61.3 |
 | Video-MME (w/ sub.) | 71.9 | 70.5/77.9 | 73.3/79.1 | - |
 | **Agent Grounding** | | | | |
 | ScreenSpot-Pro (Acc) | 52.8 | 39.4 | 43.6 | - |
 | ScreenSpot-V2 (Acc) | 91.4 | - | - | - |
 | OSWorld-G (Acc) | 52.5 | 46.5 | - | - |
 | **Long Document** | | | | |
 | MMLongBench-DOC (Acc) | 42.1 | - | 38.8 | - |
 </div>
 ## 3. Usage
 ### 3.1. Inference with VLLM (recommended)
 As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
 ```shell
 MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
 ```
 > [!Note]
 > It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
 ```python
-#SDK模型下载
+from transformers import AutoProcessor
-from modelscope import snapshot_download
+from vllm import LLM, SamplingParams
-model_dir = snapshot_download('moonshotai/Kimi-VL-A3B-Thinking-2506')
+
-```
+model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
-Git下载
+llm = LLM(
-```
+    model_path,
-#Git模型下载
+    trust_remote_code=True,
-git clone https://www.modelscope.cn/moonshotai/Kimi-VL-A3B-Thinking-2506.git
+    max_num_seqs=8,
    max_model_len=131072,
    limit_mm_per_prompt={"image": 256}
 )
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
 import requests
 from PIL import Image
 def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text
 OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
 url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
 image = Image.open(requests.get(url,stream=True).raw)
 messages = [
    {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
 ]
 text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
 outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
 generated_text = outputs[0].outputs[0].text
 thinking, summary = extract_thinking_and_summary(generated_text)
 print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
 ```
-<p style="color: lightgrey;">如果您是本模型的贡献者，我们邀请您根据<a href="https://modelscope.cn/docs/ModelScope%E6%A8%A1%E5%9E%8B%E6%8E%A5%E5%85%A5%E6%B5%81%E7%A8%8B%E6%A6%82%E8%A7%88" style="color: lightgrey; text-decoration: underline;">模型贡献文档</a>，及时完善模型卡片内容。</p>
+
 ### 3.2. Inference with 🤗 Hugging Face Transformers 
 We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment. 
 ```python
 from PIL import Image
 from transformers import AutoModelForCausalLM, AutoProcessor
 def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text
 OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
 url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
 model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
 model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
 )
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 image_paths = ["url"]
 images = [Image.open(path) for path in image_paths]
 messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": ""What kind of cat is this? Answer with one word."}],
    },
 ]
 text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
 inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
 generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
 generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
 response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]
 print(response)
 ```
 ## 4. Citation
 ```
@misc{kimiteam2025kimivltechnicalreport,
      title={{Kimi-VL} Technical Report}, 
      author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
      year={2025},
      eprint={2504.07491},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07491}, 
 }
 ```
--- a/chat_template.jinja
+++ b/chat_template.jinja
@ -0,0 +1,31 @@
 {%- for message in messages -%}
  {%- if loop.first and messages[0]['role'] != 'system' -%}
    {{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}
  {%- endif -%}
  {%- if message['role'] == 'system' -%}
    {{'<|im_system|>'}}
  {%- endif -%}
  {%- if message['role'] == 'user' -%}
    {{'<|im_user|>'}}
  {%- endif -%}
  {%- if message['role'] == 'assistant' -%}
    {{'<|im_assistant|>'}}
  {%- endif -%}
  {{- message['role'] -}}
  {{'<|im_middle|>'}}
  {%- if message['content'] is string -%}
    {{- message['content'] + '<|im_end|>' -}}
  {%- else -%}
    {%- for content in message['content'] -%}
      {%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
        {{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}
      {%- else -%}
        {{content['text']}}
      {%- endif -%}
    {%- endfor -%}
    {{'<|im_end|>'}}
  {%- endif -%}
 {%- endfor -%}
 {%- if add_generation_prompt -%}
  {{'<|im_assistant|>assistant<|im_middle|>'}}
 {%- endif -%}
--- a/config.json
+++ b/config.json
@ -0,0 +1,75 @@
 {
  "architectures": [
    "KimiVLForConditionalGeneration"
  ],
  "auto_map": {
    "AutoConfig": "configuration_kimi_vl.KimiVLConfig",
    "AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
    "AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
  },
  "vision_config": {
    "model_type": "moonvit",
    "patch_size": 14,
    "num_attention_heads": 16,
    "num_hidden_layers": 27,
    "hidden_size": 1152,
    "intermediate_size": 4304,
    "init_pos_emb_height": 64,
    "init_pos_emb_width": 64,
    "merge_kernel_size": [
      2,
      2
    ],
    "torch_dtype": "bfloat16"
  },
  "text_config": {
    "vocab_size": 163840,
    "max_position_embeddings": 131072,
    "hidden_size": 2048,
    "intermediate_size": 11264,
    "moe_intermediate_size": 1408,
    "num_hidden_layers": 27,
    "num_attention_heads": 16,
    "n_shared_experts": 2,
    "n_routed_experts": 64,
    "ep_size": 1,
    "routed_scaling_factor": 2.446,
    "kv_lora_rank": 512,
    "q_lora_rank": null,
    "qk_rope_head_dim": 64,
    "v_head_dim": 128,
    "qk_nope_head_dim": 128,
    "topk_method": "noaux_tc",
    "n_group": 1,
    "topk_group": 1,
    "num_experts_per_tok": 6,
    "moe_layer_freq": 1,
    "first_k_dense_replace": 1,
    "norm_topk_prob": true,
    "scoring_func": "sigmoid",
    "aux_loss_alpha": 0.001,
    "seq_aux": true,
    "num_key_value_heads": 16,
    "hidden_act": "silu",
    "initializer_range": 0.02,
    "rms_norm_eps": 1e-05,
    "pretraining_tp": 1,
    "use_cache": true,
    "rope_theta": 800000.0,
    "rope_scaling": null,
    "attention_bias": false,
    "attention_dropout": 0.0,
    "bos_token_id": 163584,
    "pad_token_id": 163839,
    "eos_token_id": 163585,
    "torch_dtype": "bfloat16",
    "tie_word_embeddings": false
  },
  "ignore_index": -100,
  "media_placeholder_token_id": 163605,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.50.3",
  "tie_word_embeddings": false,
  "vocab_size": 163840,
  "model_type": "kimi_vl"
 }
--- a/configuration.json
+++ b/configuration.json
@ -0,0 +1 @@
 {"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}
--- a/configuration_kimi_vl.py
+++ b/configuration_kimi_vl.py
@ -0,0 +1,284 @@
 from transformers.configuration_utils import PretrainedConfig
 from transformers.utils import logging
 from typing import Optional, Union
 logger = logging.get_logger(__name__)
 DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
 class DeepseekV3Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`DeepseekV3Model`]. It is used to instantiate an DeepSeek
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the DeepSeek-V3.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Copy from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/configuration_deepseek.py
    Args:
        vocab_size (`int`, *optional*, defaults to 129280):
            Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`DeepseekV3Model`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations.
        moe_intermediate_size (`int`, *optional*, defaults to 1407):
            Dimension of the MoE representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer decoder.
        num_nextn_predict_layers (`int`, *optional*, defaults to 1):
            Number of nextn predict layers in the DeepSeekV3 Model.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer decoder.
        n_shared_experts (`int`, *optional*, defaults to None):
            Number of shared experts, None means dense model.
        n_routed_experts (`int`, *optional*, defaults to None):
            Number of routed experts, None means dense model.
        routed_scaling_factor (`float`, *optional*, defaults to 1.0):
            Scaling factor or routed experts.
        topk_method (`str`, *optional*, defaults to `gready`):
            Topk method used in routed gate.
        n_group (`int`, *optional*, defaults to None):
            Number of groups for routed experts.
        topk_group (`int`, *optional*, defaults to None):
            Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
        num_experts_per_tok (`int`, *optional*, defaults to None):
            Number of selected experts, None means dense model.
        moe_layer_freq (`int`, *optional*, defaults to 1):
            The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
        first_k_dense_replace (`int`, *optional*, defaults to 0):
            Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
                                                            \--k dense layers--/
        norm_topk_prob (`bool`, *optional*, defaults to False):
            Whether to normalize the weights of the routed experts.
        scoring_func (`str`, *optional*, defaults to 'softmax'):
            Method of computing expert weights.
        aux_loss_alpha (`float`, *optional*, defaults to 0.001):
            Auxiliary loss weight coefficient.
        seq_aux = (`bool`, *optional*, defaults to True):
            Whether to compute the auxiliary loss for each individual sample.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        pad_token_id (`int`, *optional*):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 1):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 2):
            End of stream token id.
        pretraining_tp (`int`, *optional*, defaults to 1):
            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
            issue](https://github.com/pytorch/pytorch/issues/76232).
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        rope_scaling (`Dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
            `max_position_embeddings` to the expected new maximum.
        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
            Whether to use a bias in the query, key, value and output projection layers during self-attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
    ```python
    >>> from transformers import DeepseekV3Model, DeepseekV3Config
    >>> # Initializing a Deepseek-V3 style configuration
    >>> configuration = DeepseekV3Config()
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "deepseek_v3"
    keys_to_ignore_at_inference = ["past_key_values"]
    def __init__(
        self,
        vocab_size=129280,
        hidden_size=7168,
        intermediate_size=18432,
        moe_intermediate_size=2048,
        num_hidden_layers=61,
        num_nextn_predict_layers=1,
        num_attention_heads=128,
        num_key_value_heads=128,
        n_shared_experts=1,
        n_routed_experts=256,
        ep_size=1,
        routed_scaling_factor=2.5,
        kv_lora_rank=512,
        q_lora_rank=1536,
        qk_rope_head_dim=64,
        v_head_dim=128,
        qk_nope_head_dim=128,
        topk_method="noaux_tc",
        n_group=8,
        topk_group=4,
        num_experts_per_tok=8,
        moe_layer_freq=1,
        first_k_dense_replace=3,
        norm_topk_prob=True,
        scoring_func="sigmoid",
        aux_loss_alpha=0.001,
        seq_aux=True,
        hidden_act="silu",
        max_position_embeddings=4096,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        pad_token_id=None,
        bos_token_id=0,
        eos_token_id=1,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        rope_scaling=None,
        attention_bias=False,
        attention_dropout=0.0,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.moe_intermediate_size = moe_intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_nextn_predict_layers = num_nextn_predict_layers
        self.num_attention_heads = num_attention_heads
        self.n_shared_experts = n_shared_experts
        self.n_routed_experts = n_routed_experts
        self.ep_size = ep_size
        self.routed_scaling_factor = routed_scaling_factor
        self.kv_lora_rank = kv_lora_rank
        self.q_lora_rank = q_lora_rank
        self.qk_rope_head_dim = qk_rope_head_dim
        self.v_head_dim = v_head_dim
        self.qk_nope_head_dim = qk_nope_head_dim
        self.topk_method = topk_method
        self.n_group = n_group
        self.topk_group = topk_group
        self.num_experts_per_tok = num_experts_per_tok
        self.moe_layer_freq = moe_layer_freq
        self.first_k_dense_replace = first_k_dense_replace
        self.norm_topk_prob = norm_topk_prob
        self.scoring_func = scoring_func
        self.aux_loss_alpha = aux_loss_alpha
        self.seq_aux = seq_aux
        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
 class MoonViTConfig(PretrainedConfig):
    model_type = "moonvit"
    def __init__(
        self,
        patch_size: int = 14,
        init_pos_emb_height: int = 64,
        init_pos_emb_width: int = 64,
        num_attention_heads: int = 16,
        num_hidden_layers: int = 27,
        hidden_size: int = 1152,
        intermediate_size: int = 4304,
        merge_kernel_size: tuple[int, int] = (2, 2),
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.patch_size = patch_size
        # Positional embedding config
        self.init_pos_emb_height = init_pos_emb_height
        self.init_pos_emb_width = init_pos_emb_width
        # Transformer config
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        # Patch merger config
        self.merge_kernel_size = merge_kernel_size
 class KimiVLConfig(PretrainedConfig):
    model_type = "kimi_vl"
    def __init__(
        self,
        vision_config: Optional[Union[dict, MoonViTConfig]] = None,
        text_config: Optional[Union[dict, DeepseekV3Config]] = None,
        ignore_index: int = -100,
        media_placeholder_token_id: int = 163605,
        pad_token_id: int = 0,
        **kwargs,
    ):
        if vision_config is None:
            vision_config = MoonViTConfig()
        elif isinstance(vision_config, dict):
            vision_config = MoonViTConfig(**vision_config)
        self.vision_config = vision_config
        if text_config is None:
            text_config = DeepseekV3Config()
        elif isinstance(text_config, dict):
            text_config = DeepseekV3Config(**text_config)
        self.text_config = text_config
        self.ignore_index = ignore_index
        self.media_placeholder_token_id = media_placeholder_token_id
        attn_implementation = kwargs.get("attn_implementation")
        if attn_implementation is not None:
            if attn_implementation in ["eager", "flash_attention_2"]:
                self._attn_implementation = attn_implementation
                self.vision_config._attn_implementation = attn_implementation
                self.text_config._attn_implementation = attn_implementation
            else:
                raise ValueError(
                    f"Invalid attention implementation: {attn_implementation}"
                )
        super().__init__(pad_token_id=pad_token_id, **kwargs)
--- a/figures/arch.png
+++ b/figures/arch.png
--- a/figures/demo1.png
+++ b/figures/demo1.png
--- a/figures/demo2.png
+++ b/figures/demo2.png
--- a/figures/logo.png
+++ b/figures/logo.png
--- a/figures/screenshot.png
+++ b/figures/screenshot.png
--- a/figures/thinking_perf.png
+++ b/figures/thinking_perf.png
--- a/generation_config.json
+++ b/generation_config.json
@ -0,0 +1,9 @@
 {
    "bos_token_id": 163584,
    "pad_token_id": 163838,
    "eos_token_id": [
        163585
    ],
    "do_sample": true,
    "temperature": 0.6
 }
--- a/image_processing_kimi_vl.py
+++ b/image_processing_kimi_vl.py
@ -0,0 +1,126 @@
 """Image processor class for KimiVL."""
 import math
 import numpy as np
 from PIL import Image
 from typing import Optional, Union
 import torch
 from torchvision.transforms import functional as TF
 from transformers.image_utils import ImageInput, make_list_of_images, valid_images
 from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
 from transformers.utils import TensorType
 OPENAI_DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073)
 OPENAI_DATASET_STD = (0.26862954, 0.26130258, 0.27577711)
 class KimiVLImageProcessor(BaseImageProcessor):
    model_type = "kimi_vl"
    def __init__(
        self,
        patch_size: int = 14,
        pad_input: bool = False,
        image_mean: tuple[float, float, float] = OPENAI_DATASET_MEAN,
        image_std: tuple[float, float, float] = OPENAI_DATASET_STD,
        in_token_limit: int = 4096,
        merge_kernel_size: list[int, int] = [2, 2],
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.in_token_limit = in_token_limit
        self.patch_size = patch_size
        self.pad_input = pad_input
        self.image_mean = image_mean
        self.image_std = image_std
        self.merge_kernel_size = merge_kernel_size
    def rescale(
        self, image: Image.Image, merge_kernel_size: list[int, int] = [2, 2]
    ) -> Image.Image:
        w, h = image.size
        patch_size = self.patch_size
        if (w // patch_size) * (h // patch_size) > self.in_token_limit:
            scale = math.sqrt(self.in_token_limit / ((w // patch_size) * (h // patch_size)))
            new_w, new_h = int(w * scale), int(h * scale)
            image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
        if self.pad_input:
            new_w, new_h = image.size
            pad_size_h = merge_kernel_size[0] * patch_size
            pad_size_w = merge_kernel_size[1] * patch_size
            pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
            pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
            image = TF.pad(image, (0, 0, pad_w, pad_h))
        else:
            new_w, new_h = image.size
            new_w = new_w - new_w % patch_size
            new_h = new_h - new_h % patch_size
            image = TF.center_crop(image, (new_h, new_w))
        w, h = image.size
        if w // patch_size >= 512 or h // patch_size >= 512:
            raise ValueError("Exceed pos emb")
        return image
    def to_tensor(self, image: Image.Image) -> torch.Tensor:
        return TF.to_tensor(image.convert("RGB"))
    def normalize(self, image: torch.Tensor) -> torch.Tensor:
        return TF.normalize(image, self.image_mean, self.image_std)
    def patchify(self, image: torch.Tensor) -> tuple[torch.Tensor, list[int, int]]:
        patch_size = self.patch_size
        C, H, W = image.shape
        patches = image.reshape(C, H // patch_size, patch_size, W // patch_size, patch_size)
        patches = patches.permute(1, 3, 0, 2, 4)
        patches = patches.contiguous().view(-1, C, patch_size, patch_size)
        grid_hw = (H // patch_size, W // patch_size)
        return patches, grid_hw
    def _preprocess(self, image: ImageInput) -> tuple[torch.Tensor, list[int, int]]:
        """
        Preprocess image and patchify it.
        Args:
            image (`ImageInput`):
                Image to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
        Returns:
            patches: torch.Tensor
            grid_hw: list[int, int]
        """
        image = self.rescale(image, self.merge_kernel_size)
        image = self.to_tensor(image)
        image = self.normalize(image)
        patches, grid_hw = self.patchify(image)
        return patches, grid_hw
    def preprocess(
        self,
        images: ImageInput,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ) -> BatchFeature:
        images = make_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        pixel_values, image_grid_hws = [], []
        for image in images:
            patches, image_grid_hw = self._preprocess(image)
            pixel_values.append(patches)
            image_grid_hws.append(image_grid_hw)
        pixel_values = torch.concat(pixel_values, dim=0)
        image_grid_hws = np.array(image_grid_hws)
        data = {"pixel_values": pixel_values, "image_grid_hws": image_grid_hws}
        return BatchFeature(data=data, tensor_type=return_tensors)
--- a/model-00001-of-00007.safetensors
+++ b/model-00001-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:0fcc575e77d59bbf4504439266d770827a3018b56e2eecf4ab5a218c944d6326
 size 135
--- a/model-00002-of-00007.safetensors
+++ b/model-00002-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:2b830e3948efd3257f27de6b2960c64a6704dafed27ebf6f53b9e1af3a859fe2
 size 135
--- a/model-00003-of-00007.safetensors
+++ b/model-00003-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:729e85635b6f8c04a5a50e1330b6d190651f9a5901a34d3ee7caa0d20dbc6bbb
 size 135
--- a/model-00004-of-00007.safetensors
+++ b/model-00004-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:2339e10a3d58cb9908660eb5b76ae75b3c0c091a773575083d4920a905ea7a45
 size 135
--- a/model-00005-of-00007.safetensors
+++ b/model-00005-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:0dd4dd6c7179e98fb69402f7d4719c175ffcfa5f5cdea208fb7be9e2f6204230
 size 135
--- a/model-00006-of-00007.safetensors
+++ b/model-00006-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:db8f22f385d5743cabc235eff38ac77ab7c77023f906e8c56b074b56cda67bb7
 size 135
--- a/model-00007-of-00007.safetensors
+++ b/model-00007-of-00007.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:3a4aac88787ee97d9a34627d0c03fc7883b98253a239fd93a5cc82a9c8f0d8a7
 size 135
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
--- a/modeling_kimi_vl.py
+++ b/modeling_kimi_vl.py
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@ -0,0 +1,20 @@
 {
    "auto_map": {
        "AutoImageProcessor": "image_processing_kimi_vl.KimiVLImageProcessor",
        "AutoProcessor": "processing_kimi_vl.KimiVLProcessor"
    },
    "in_token_limit": 16384,
    "patch_size": 14,
    "num_pooled_tokens": 1024,
    "image_mean": [
        0.5,
        0.5,
        0.5
    ],
    "image_std": [
        0.5,
        0.5,
        0.5
    ],
    "pad_input": true
 }
--- a/processing_kimi_vl.py
+++ b/processing_kimi_vl.py
@ -0,0 +1,170 @@
 # coding=utf-8
 # Copyright 2025 The Moonshot Team and HuggingFace Inc. team. All rights reserved.
 #
 # The code is based on the Qwen2VL processor (qwen2_vl/processing_qwen2_vl.py), but modified for KimiVL.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Processor class for KimiVL.
 """
 from typing import List, Union
 from transformers.feature_extraction_utils import BatchFeature
 from transformers.image_utils import ImageInput
 from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, _validate_images_text_input_order
 from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
 from transformers.utils import logging
 logger = logging.get_logger(__name__)
 class KimiVLProcessorKwargs(ProcessingKwargs, total=False):
    _defaults = {
        "text_kwargs": {
            "padding": False,
        },
        "images_kwargs": {},
    }
 class KimiVLProcessor(ProcessorMixin):
    r"""
    Constructs a KimiVL processor which wraps a KimiVL image processor and a tokenizer into a single processor.
    [`KimiVLProcessor`] offers all the functionalities of [`KimiVLImageProcessor`] and [`TikTokenTokenizer`]. See the
    [`~KimiVLProcessor.__call__`] and [`~KimiVLProcessor.decode`] for more information.
    Args:
        image_processor ([`KimiVLImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`TikTokenTokenizer`], *optional*):
            The tokenizer is a required input.
        chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
            in a chat into a tokenizable string.
    """
    attributes = ["image_processor", "tokenizer"]
    valid_kwargs = [ "chat_template"]
    image_processor_class = "AutoImageProcessor"
    tokenizer_class = "AutoTokenizer"
    def __init__(
        self,
        image_processor=None,
        tokenizer=None,
        chat_template=None,
        **kwargs,
    ):
        self.image_token = "<|media_pad|>"
        super().__init__(image_processor, tokenizer, chat_template=chat_template)
    def __call__(
        self,
        images: ImageInput = None,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        **kwargs: Unpack[KimiVLProcessorKwargs],
    ) -> BatchFeature:
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to TikTokenTokenizer's [`~TikTokenTokenizer.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the docstring
        of the above two methods for more information.
        Args:
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if images is None and text is None:
            raise ValueError("You have to specify at least one of `images` or `text`.")
        # check if images and text inputs are reversed for BC
        images, text = _validate_images_text_input_order(images, text)
        output_kwargs = self._merge_kwargs(
            KimiVLProcessorKwargs,
            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
            **kwargs,
        )
        if images is not None:
            image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
            image_grid_hws = image_inputs["image_grid_hws"]
        else:
            image_inputs = {}
            image_grid_hws = None
        if isinstance(text, str):
            text = [text]
        elif not isinstance(text, list) and not isinstance(text[0], str):
            raise ValueError("Invalid input text. Please provide a string, or a list of strings")
        if image_grid_hws is not None:
            merge_length = self.image_processor.merge_kernel_size[0] * self.image_processor.merge_kernel_size[1]
            index = 0
            for i in range(len(text)):
                while self.image_token in text[i]:
                    text[i] = text[i].replace(
                        self.image_token,
                        "<|placeholder|>" * (image_grid_hws[index].prod() // merge_length),
                        1,
                    )
                    index += 1
                text[i] = text[i].replace("<|placeholder|>", self.image_token)
        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
        return BatchFeature(data={**text_inputs, **image_inputs})
    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)
    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)
    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
 __all__ = ["KimiVLProcessorKwargs"]
--- a/tiktoken.model
+++ b/tiktoken.model
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:e065f594aa8ac9c7c583e31626a98bc910072b59f7c24fe9f5980653bdf936ae
 size 132
--- a/tokenization_moonshot.py
+++ b/tokenization_moonshot.py
@ -0,0 +1,302 @@
 import os
 import tiktoken
 from logging import getLogger
 from pathlib import Path
 from typing import (
    cast,
    Tuple,
    Dict,
    Iterator,
    List,
    Union,
    Optional,
 )
 from shutil import copyfile
 from tiktoken.load import load_tiktoken_bpe
 from tokenizers import AddedToken
 from transformers.tokenization_utils import PreTrainedTokenizer
 from transformers.utils import to_py_obj
 from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
 logger = getLogger(__name__)
 VOCAB_FILES_NAMES = {"vocab_file": "tiktoken.model"}
 SPIECE_UNDERLINE = "▁"
 class TikTokenTokenizer(PreTrainedTokenizer):
    """
    Tokenizing and encoding/decoding text using the Tiktoken tokenizer. See megatron/tokenizer/tiktoken_tokenizer.py.
    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
    this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
            The path to the Tiktoken model file.
        bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|begin_of_text|>",`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
        eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|end_of_text|>"`):
            The end of sequence token.
        unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_249|>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead. The second to last item in special_tokens.
        pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_250|>"`):
            The token used for padding, for example when batching sequences of different lengths.
        additional_special_tokens (list of `str`, *optional*):
            A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
            skipped when decoding if `skip_special_tokens` is set to `True`.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    model_input_names = ["input_ids", "attention_mask"]
    special_tokens: Dict[str, int]
    num_reserved_special_tokens = 256
    pat_str = "|".join(
        [
            r"""[\p{Han}]+""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""\p{N}{1,3}""",
            r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
            r"""\s*[\r\n]+""",
            r"""\s+(?!\S)""",
            r"""\s+""",
        ]
    )
    def __init__(
        self,
        vocab_file,
        bos_token: Union[str, AddedToken] = "[BOS]",
        eos_token: Union[str, AddedToken] = "[EOS]",
        unk_token: Union[str, AddedToken] = "[UNK]",
        pad_token: Union[str, AddedToken] = "[PAD]",
        additional_special_tokens: Optional[List[str]] = None,
        added_tokens_decoder: Optional[dict] = None,
        **kwargs,
    ):
        assert os.path.isfile(vocab_file), vocab_file
        if additional_special_tokens is None:
            additional_special_tokens = [
                "<|im_end|>",
                "<|im_middle|>",
                "<|im_user|>",
                "<|im_assistant|>",
                "<|im_system|>",
            ]
        special_tokens_mapping = {
            i: added_tokens_decoder[i].content for i in added_tokens_decoder
        }
        self.vocab_file = vocab_file
        mergeable_ranks = load_tiktoken_bpe(vocab_file)
        num_base_tokens = len(mergeable_ranks)
        self.special_tokens = {
            special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
            for i in range(
                num_base_tokens, num_base_tokens + self.num_reserved_special_tokens + 2
            )
        }
        self.model = tiktoken.Encoding(
            name=Path(vocab_file).name,
            pat_str=self.pat_str,
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens,
        )
        self.n_words: int = self.model.n_vocab
        # BOS / EOS token IDs
        self.bos_id: int = self.special_tokens[str(bos_token)]
        self.eos_id: int = self.special_tokens[str(eos_token)]
        self.pad_id: int = self.special_tokens[str(pad_token)]
        self.unk_id: int = self.special_tokens[str(unk_token)]
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
        self.decoder = {}
        for i in range(self.n_words):
            # Taken from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
            decoding = "".join(
                [
                    self.byte_encoder[ord(char)]
                    for char in self.model.decode_single_token_bytes(i).decode(
                        "latin-1"
                    )
                ]
            )
            self.decoder[i] = decoding
        self.encoder = {}
        for i in range(self.n_words):
            if i in self.decoder:
                self.encoder[self.decoder[i]] = i
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.all_special_ids_set = set(self.all_special_ids)
    def encode(
        self, text: str, allow_special_tokens: bool = True, **kwargs
    ) -> List[int]:
        """
        Encodes a string into a list of token IDs.
        Args:
            text (str): The input string to be encoded.
        Returns:
            list[int]: A list of token IDs.
        """
        # If there are other args, we should call super().encode because there are a lot of code
        # to handle those args. supper().encode finally will call _tokenize and _convert_token_to_id.
        if len(kwargs) > 0:
            return super().encode(text, **kwargs)
        assert type(text) is str
        # The tiktoken tokenizer can handle <=400k chars without
        # pyo3_runtime.PanicException.
        TIKTOKEN_MAX_ENCODE_CHARS = 400_000
        # https://github.com/openai/tiktoken/issues/195
        # Here we iterate over subsequences and split if we exceed the limit
        # of max consecutive non-whitespace or whitespace characters.
        MAX_NO_WHITESPACES_CHARS = 25_000
        substrs = (
            substr
            for i in range(0, len(text), TIKTOKEN_MAX_ENCODE_CHARS)
            for substr in self._split_whitespaces_or_nonwhitespaces(
                text[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
            )
        )
        t: List[int] = []
        for substr in substrs:
            if allow_special_tokens:
                t.extend(
                    # we should consider special token as a common token
                    self.model.encode(
                        substr,
                        allowed_special="all",
                    )
                )
            else:
                t.extend(
                    # we should consider special token as a common token
                    self.model.encode(
                        substr,
                        disallowed_special=(),
                    )
                )
        return t
    def decode(self, token_ids: Union[int, List[int]], **kwargs) -> str:
        """
        Decodes a list of token IDs into a string.
        Args:
            t (List[int]): The list of token IDs to be decoded.
        Returns:
            str: The decoded string.
        """
        # If there are other args, we should call super().decode because there are a lot of code
        # to handle those args. supper().encode finally will call convert_tokens_to_string and _convert_id_to_token.
        if len(kwargs) > 0:
            return super().decode(token_ids, **kwargs)
        token_ids = to_py_obj(token_ids)
        if type(token_ids) is int:
            token_ids = [token_ids]
        return self.model.decode(cast(List[int], token_ids))
    @staticmethod
    def _split_whitespaces_or_nonwhitespaces(
        s: str, max_consecutive_slice_len: int
    ) -> Iterator[str]:
        """
        Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
        consecutive whitespaces or consecutive non-whitespaces.
        """
        current_slice_len = 0
        current_slice_is_space = s[0].isspace() if len(s) > 0 else False
        slice_start = 0
        for i in range(len(s)):
            is_now_space = s[i].isspace()
            if current_slice_is_space ^ is_now_space:
                current_slice_len = 1
                current_slice_is_space = is_now_space
            else:
                current_slice_len += 1
                if current_slice_len > max_consecutive_slice_len:
                    yield s[slice_start:i]
                    slice_start = i
                    current_slice_len = 1
        yield s[slice_start:]
    """ ----- Below are the abstract methods required by PreTrainedTokenizer ----- """
    @property
    def vocab_size(self) -> int:
        return self.n_words
    def get_vocab(self) -> Dict[str, int]:
        return self.encoder
    def _tokenize(self, text: str, **kwargs) -> List[str]:
        return [self.decoder[t] for t in self.encode(text)]
    def _convert_token_to_id(self, token: str) -> int:
        return self.encoder.get(token, self.unk_id)
    def _convert_id_to_token(self, index: int) -> str:
        return self.decoder.get(index)
    @staticmethod
    def clean_up_tokenization(out_string: str) -> str:
        return out_string
    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        text = "".join(tokens).replace(SPIECE_UNDERLINE, "")
        text = bytearray([self.byte_decoder[c] for c in text]).decode(
            "utf-8", "replace"
        )
        return text
    def save_vocabulary(
        self, save_directory: str, filename_prefix: Optional[str] = None
    ) -> Tuple[str]:
        if not os.path.isdir(save_directory):
            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
            return
        out_vocab_file = os.path.join(
            save_directory,
            (filename_prefix + "-" if filename_prefix else "")
            + VOCAB_FILES_NAMES["vocab_file"],
        )
        if os.path.abspath(self.vocab_file) != os.path.abspath(
            out_vocab_file
        ) and os.path.isfile(self.vocab_file):
            copyfile(self.vocab_file, out_vocab_file)
        return (out_vocab_file,)
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@ -0,0 +1,134 @@
 {
  "added_tokens_decoder": {
    "163584": {
      "content": "[BOS]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163585": {
      "content": "[EOS]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163586": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163601": {
      "content": "<|im_middle|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163587": {
      "content": "<|im_user|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163588": {
      "content": "<|im_assistant|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163594": {
      "content": "<|im_system|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163602": {
      "content": "<|media_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163603": {
      "content": "<|media_content|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163604": {
      "content": "<|media_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163605": {
      "content": "<|media_pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163838": {
      "content": "[PAD]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "163839": {
      "content": "[UNK]",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_end|>",
    "<|im_user|>",
    "<|im_assistant|>",
    "<|im_system|>",
    "<|im_middle|>",
    "<|media_start|>",
    "<|media_content|>",
    "<|media_end|>",
    "<|media_pad|>"
  ],
  "bos_token": "[BOS]",
  "clean_up_tokenization_spaces": false,
  "eos_token": "[EOS]",
  "extra_special_tokens": {},
  "model_max_length": 1048576,
  "pad_token": "[PAD]",
  "unk_token": "[UNK]",
  "tokenizer_class": "TikTokenTokenizer",
  "chat_template": "{%- for message in messages -%}{%- if loop.first and messages[0]['role'] != 'system' -%}{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}{%- endif -%}{%- if message['role'] == 'system' -%}{{'<|im_system|>'}}{%- endif -%}{%- if message['role'] == 'user' -%}{{'<|im_user|>'}}{%- endif -%}{%- if message['role'] == 'assistant' -%}{{'<|im_assistant|>'}}{%- endif -%}{{- message['role'] -}}{{'<|im_middle|>'}}{%- if message['content'] is string -%}{{- message['content'] + '<|im_end|>' -}}{%- else -%}{%- for content in message['content'] -%}{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}{{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}{%- else -%}{{content['text']}}{%- endif -%}{%- endfor -%}{{'<|im_end|>'}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{'<|im_assistant|>assistant<|im_middle|>'}}{%- endif -%}",
  "auto_map": {
    "AutoTokenizer": [
      "tokenization_moonshot.TikTokenTokenizer",
      null
    ]
  }
 }
		`@ -0,0 +1 @@`
							`{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}`