Upload to PaddlePaddle/PaddleOCR-VL on ModelScope hub

2026-07-16 13:42:55 +08:00 · 2025-10-21 10:12:45 +00:00
parent 78c582ff62
commit f172aef5ac
18 changed files with 13342 additions and 34 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -53,4 +53,8 @@ imgs/overview4.jpg filter=lfs diff=lfs merge=lfs -text
 imgs/table_01.jpg filter=lfs diff=lfs merge=lfs -text
 PP-DocLayoutV2/inference.pdiparams filter=lfs diff=lfs merge=lfs -text
 imgs/text_english_arabic.jpg filter=lfs diff=lfs merge=lfs -text
-PP-DocLayoutV2/inference.pdmodel filter=lfs diff=lfs merge=lfs -text
+PP-DocLayoutV2/inference.pdmodel filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
 tokenizer.model filter=lfs diff=lfs merge=lfs -text
 model.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@ -3,6 +3,7 @@ license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
 - ERNIE4.5
 - PaddleOCR
 - PaddlePaddle
 - image-to-text
 - ocr
@ -16,6 +17,7 @@ language:
 - en
 - zh
 - multilingual
 library_name: PaddleOCR
 ---
 <div align="center">
@ -42,7 +44,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
 </div>
 <div align="center">
-<img src="./imgs/allmetric.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/allmetric.png" width="800"/>
 </div>
 ## Introduction
@ -65,7 +67,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
 <!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. -->
 <div align="center">
-<img src="./imgs/paddleocrvl.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/paddleocrvl.png" width="800"/>
 </div>
@ -100,7 +102,6 @@ Python API usage:
 ```python
 from paddleocr import PaddleOCRVL
 pipeline = PaddleOCRVL()
 output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
 for res in output:
@ -120,7 +121,6 @@ for res in output:
        --network host \
        ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server
    ```
 2. Call the PaddleOCR CLI or Python API:
    ```bash
@ -129,10 +129,8 @@ for res in output:
        --vl_rec_backend vllm-server \
        --vl_rec_server_url http://127.0.0.1:8080/v1
    ```
    ```python
    from paddleocr import PaddleOCRVL
    pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
    output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
    for res in output:
@ -140,7 +138,8 @@ for res in output:
        res.save_to_json(save_path="output")
        res.save_to_markdown(save_path="output")
    ```
-
+  
 **For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).**
 ## Performance
 ### Page-Level Document Parsing 
@ -151,19 +150,18 @@ for res in output:
 ##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
 <div align="center">
-<img src="./imgs/omni15.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni15.png" width="800"/>
 </div>
 ####  2. OmniDocBench v1.0
 ##### PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0
 <div align="center">
-<img src="./imgs/omni10.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni10.png" width="800"/>
 </div>
@ -180,7 +178,7 @@ for res in output:
 PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation. 
 <div align="center">
-<img src="./imgs/omnibenchocr.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omnibenchocr.png" width="800"/>
 </div>
@ -189,7 +187,7 @@ PaddleOCR-VL’s robust and versatile capability in handling diverse document ty
 In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
 <div align="center">
-<img src="./imgs/inhouseocr.png" width="800"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouseocr.png" width="800"/>
 </div>
@ -201,7 +199,7 @@ In-house-OCR provides a evaluation of performance across multiple languages and
 Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
 <div align="center">
-<img src="./imgs/inhousetable.png" width="600"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousetable.png" width="600"/>
 </div>
 #### 3. Formula
@ -211,7 +209,7 @@ Our self-built evaluation set contains diverse types of table images, such as Ch
 In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
 <div align="center">
-<img src="./imgs/inhouse-formula.png" width="500"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouse-formula.png" width="500"/>
 </div>
@ -222,7 +220,7 @@ In-house-Formula evaluation set contains simple prints, complex prints, camera s
 The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
 <div align="center">
-<img src="./imgs/inhousechart.png" width="400"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousechart.png" width="400"/>
 </div>
@ -237,42 +235,42 @@ The evaluation set is broadly categorized into 11 chart categories, including ba
 ### Comprehensive Document Parsing
 <div align="center">
-<img src="./imgs/overview1.jpg" width="600"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview1.jpg" width="600"/>
-<img src="./imgs/overview2.jpg" width="600"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview2.jpg" width="600"/>
-<img src="./imgs/overview3.jpg" width="600"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview3.jpg" width="600"/>
-<img src="./imgs/overview4.jpg" width="600"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview4.jpg" width="600"/>
 </div>
 ### Text
 <div align="center">
-<img src="./imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
-<img src="./imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
 </div>
 ### Table
 <div align="center">
-<img src="./imgs/table_01.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
-<img src="./imgs/table_02.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
 </div>
 ### Formula
 <div align="center">
-<img src="./imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
-<img src="./imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
+<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
 </div>
 ### Chart
 <div align="center">
-  <img src="./imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
+  <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
-  <img src="./imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
+  <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
 </div>
@ -285,11 +283,13 @@ We would like to thank [ERNIE](https://github.com/PaddlePaddle/ERNIE), [Keye](ht
 If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
 ```bibtex
-@misc{paddleocrvl2025technicalreport,
+@misc{cui2025paddleocrvlboostingmultilingualdocument,
-      title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
+      title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model}, 
-      author={Cui, C. et al.},
+      author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
      year={2025},
-      primaryClass={cs.CL},
+      eprint={2510.14528},
-      howpublished={\url{https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf}}
+      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.14528}, 
 }
-```
+```
--- a/added_tokens.json
+++ b/added_tokens.json
--- a/chat_template.jinja
+++ b/chat_template.jinja
@ -0,0 +1,22 @@
 {%- if not add_generation_prompt is defined -%}
    {%- set add_generation_prompt = true -%}
 {%- endif -%}
 {%- if not cls_token is defined -%}
    {%- set cls_token = "<|begin_of_sentence|>" -%}
 {%- endif -%}
 {%- if not sep_token is defined -%}
    {%- set sep_token = "<|end_of_sentence|>" -%}
 {%- endif -%}
 {{- cls_token -}}
 {%- for message in messages -%}
    {%- if message["role"] == "user" -%}
        {{- "User: <|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" + message["content"] + "\n" -}}
    {%- elif message["role"] == "assistant" -%}
        {{- "Assistant: " + message["content"] + sep_token -}}
    {%- elif message["role"] == "system" -%}
        {{- message["content"] -}}
    {%- endif -%}
 {%- endfor -%}
 {%- if add_generation_prompt -%}
    {{- "Assistant: " -}}
 {%- endif -%}
--- a/config.json
+++ b/config.json
@ -0,0 +1,75 @@
 {
  "architectures": [
    "PaddleOCRVLForConditionalGeneration"
  ],
  "attention_probs_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
    "AutoModel": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration",
    "AutoModelForCausalLM": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration"
  },
  "compression_ratio": 1.0,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 1024,
  "ignored_index": -100,
  "image_token_id": 100295,
  "intermediate_size": 3072,
  "max_position_embeddings": 131072,
  "max_sequence_length": null,
  "model_type": "paddleocr_vl",
  "num_attention_heads": 16,
  "num_hidden_layers": 18,
  "num_key_value_heads": 2,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 500000,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.55.0",
  "use_bias": false,
  "use_cache": false,
  "use_flash_attention": false,
  "video_token_id": 101307,
  "vision_config": {
    "architectures": [
      "SiglipVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
      "AutoModel": "modeling_paddleocr_vl.SiglipVisionModel"
    },
    "hidden_act": "gelu_pytorch_tanh",
    "hidden_size": 1152,
    "image_size": 384,
    "intermediate_size": 4304,
    "layer_norm_eps": 1e-06,
    "model_type": "paddleocr_vl",
    "num_attention_heads": 16,
    "num_channels": 3,
    "num_hidden_layers": 27,
    "pad_token_id": 0,
    "patch_size": 14,
    "spatial_merge_size": 2,
    "temporal_patch_size": 2,
    "tokens_per_second": 2,
    "torch_dtype": "bfloat16"
  },
  "vision_start_token_id": 101305,
  "vocab_size": 103424,
  "weight_share_add_bias": true,
  "use_3d_rope": true,
  "rope_is_neox_style": true
 }
--- a/configuration_paddleocr_vl.py
+++ b/configuration_paddleocr_vl.py
@ -0,0 +1,191 @@
 # Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from transformers.configuration_utils import PretrainedConfig
 from transformers.modeling_rope_utils import rope_config_validation
 class PaddleOCRVisionConfig(PretrainedConfig):
    model_type = "paddleocr_vl"
    base_config_key = "vision_config"
    def __init__(
        self,
        hidden_size=768,
        intermediate_size=3072,
        num_hidden_layers=12,
        num_attention_heads=12,
        num_channels=3,
        image_size=224,
        patch_size=14,
        hidden_act="gelu_pytorch_tanh",
        layer_norm_eps=1e-6,
        attention_dropout=0.0,
        spatial_merge_size=2,
        temporal_patch_size=2,
        tokens_per_second=2,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_channels = num_channels
        self.patch_size = patch_size
        self.image_size = image_size
        self.attention_dropout = attention_dropout
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act
        self.spatial_merge_size = spatial_merge_size
        self.temporal_patch_size = temporal_patch_size
        self.tokens_per_second = tokens_per_second
 class PaddleOCRVLConfig(PretrainedConfig):
    """
    Configuration class.
    This class stores the configuration of an Ernie model, defining the model architecture.
    It inherits from PretrainedConfig and can be used to control model outputs.
    """
    model_type = "paddleocr_vl"
    keys_to_ignore_at_inference = ["past_key_values"]
    sub_configs = {"vision_config": PaddleOCRVisionConfig}
    # Default tensor parallel plan for base model `Qwen3`
    base_model_tp_plan = {
        "layers.*.self_attn.q_proj": "colwise",
        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
        "layers.*.mlp.gate_proj": "colwise",
        "layers.*.mlp.up_proj": "colwise",
        "layers.*.mlp.down_proj": "rowwise",
    }
    base_model_pp_plan = {
        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
        "norm": (["hidden_states"], ["hidden_states"]),
    }
    def __init__(
        self,
        vocab_size=32000,
        hidden_size=768,
        intermediate_size=11008,
        max_position_embeddings=32768,
        num_hidden_layers=2,
        num_attention_heads=2,
        image_token_id=101304,
        video_token_id=101305,
        vision_start_token_id=101306,
        rms_norm_eps=1e-6,
        use_cache=False,
        use_flash_attention=False,
        pad_token_id=0,
        bos_token_id=1,
        eos_token_id=2,
        head_dim=128,
        hidden_act="silu",
        use_bias=False,
        rope_theta=10000,
        weight_share_add_bias=True,
        ignored_index=-100,
        attention_probs_dropout_prob=0.0,
        hidden_dropout_prob=0.0,
        compression_ratio: float = 1.0,
        num_key_value_heads=None,
        max_sequence_length=None,
        tie_word_embeddings=False,
        vision_config=None,
        rope_scaling=None,
        **kwargs,
    ):
        """
        Initialize configuration with default or specified parameters.
        Args:
            vocab_size (int): Size of the vocabulary (number of unique tokens)
            hidden_size (int): Dimensionality of the encoder layers and the pooler layer
            intermediate_size (int): Dimensionality of the "intermediate" (feed-forward) layer
            max_position_embeddings (int): Maximum sequence length the model can handle
            num_hidden_layers (int): Number of hidden layers in the Transformer encoder
            num_attention_heads (int): Number of attention heads for each attention layer
            rms_norm_eps (float): The epsilon used by the RMS normalization layers
            use_cache (bool): Whether to use caching for faster generation (decoding)
            use_flash_attention (bool): Whether to use FlashAttention for optimized attention computation
            pad_token_id (int): Token ID used for padding sequences
            bos_token_id (int): Token ID used for beginning-of-sequence
            eos_token_id (int): Token ID used for end-of-sequence
            use_bias (bool): Whether to use bias terms in linear layers
            rope_theta (float): The base period of the RoPE embeddings
            weight_share_add_bias (bool): Whether to share bias weights in certain layers
            ignored_index (int): Target value that is ignored during loss computation
            attention_probs_dropout_prob (float): Dropout probability for attention weights
            hidden_dropout_prob (float): Dropout probability for hidden layers
            compression_ratio (float): Ratio for KV cache compression (1.0 = no compression)
            num_key_value_heads (int): Number of key/value heads (for Grouped Query Attention)
            max_sequence_length (int): Maximum sequence length for positional embeddings
            **kwargs: Additional keyword arguments passed to parent class
        """
        # Set default for tied embeddings if not specified.
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            **kwargs,
        )
        if isinstance(vision_config, dict):
            self.vision_config = self.sub_configs["vision_config"](**vision_config)
        elif vision_config is None:
            self.vision_config = self.sub_configs["vision_config"]()        
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.max_position_embeddings = max_position_embeddings
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.rms_norm_eps = rms_norm_eps
        self.use_cache = use_cache
        self.use_flash_attention = use_flash_attention
        self.pad_token_id = pad_token_id
        self.bos_token_id = bos_token_id
        self.eos_token_id = eos_token_id
        self.image_token_id = image_token_id
        self.video_token_id = video_token_id
        self.vision_start_token_id = vision_start_token_id
        self.head_dim = head_dim
        self.hidden_act=hidden_act
        self.sliding_window = None
        self.hidden_size = hidden_size
        self.use_bias = use_bias
        self.weight_share_add_bias = weight_share_add_bias
        self.rope_theta = rope_theta
        self.ignored_index = ignored_index
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.hidden_dropout_prob = hidden_dropout_prob
        self.compression_ratio = compression_ratio
        self.num_key_value_heads = num_key_value_heads
        self.max_sequence_length = max_sequence_length
        self.rope_scaling = rope_scaling
        if self.rope_scaling is not None and "type" in self.rope_scaling:
            if self.rope_scaling["type"] == "mrope":
                self.rope_scaling["type"] = "default"
            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
        rope_config_validation(self, ignore_keys={"mrope_section"})        
        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
--- a/generation_config.json
+++ b/generation_config.json
@ -0,0 +1,6 @@
 {
  "_from_model_config": true,
  "eos_token_id": 2,
  "transformers_version": "4.55.0",
  "use_cache": false
 }
--- a/image_processing.py
+++ b/image_processing.py
@ -0,0 +1,569 @@
 # Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Image processor class for PaddleOCR-VL."""
 import math
 from typing import Dict, List, Optional, Union
 import numpy as np
 import torch
 from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
 from torchvision.transforms import functional as TF
 from transformers.image_transforms import (
    convert_to_rgb,
    resize,
    to_channel_dimension_format,
 )
 from transformers.image_utils import (
    OPENAI_CLIP_MEAN,
    OPENAI_CLIP_STD,
    ChannelDimension,
    PILImageResampling,
    get_image_size,
    infer_channel_dimension_format,
    is_scaled_image,
    is_valid_image,
    make_list_of_images,
    to_numpy_array,
    valid_images,
    validate_preprocess_arguments,
 )
 from transformers.utils import TensorType, is_vision_available, logging
 logger = logging.get_logger(__name__)
 if is_vision_available():
    from PIL import Image
 ImageInput = Union[
    "PIL.Image.Image",
    np.ndarray,
    "torch.Tensor",
    List["PIL.Image.Image"],
    List[np.ndarray],
    List["torch.Tensor"],
 ]  # noqa
 VideoInput = Union[
    List["PIL.Image.Image"],
    "np.ndarray",
    "torch.Tensor",
    List["np.ndarray"],
    List["torch.Tensor"],
    List[List["PIL.Image.Image"]],
    List[List["np.ndarrray"]],
    List[List["torch.Tensor"]],
 ]  # noqa
 def make_batched_images(images) -> List[List[ImageInput]]:
    """
    Accepts images in list or nested list format, and makes a list of images for preprocessing.
    Args:
        images (`Union[List[List[ImageInput]], List[ImageInput], ImageInput]`):
            The input image.
    Returns:
        list: A list of images.
    """
    if (
        isinstance(images, (list, tuple))
        and isinstance(images[0], (list, tuple))
        and is_valid_image(images[0][0])
    ):
        return [img for img_list in images for img in img_list]
    elif isinstance(images, (list, tuple)) and is_valid_image(images[0]):
        return images
    elif is_valid_image(images):
        return [images]
    raise ValueError(f"Could not make batched images from {images}")
 def adjust_size(size, patch_size):
    num_patches = size // patch_size
    if num_patches % 2 != 0:  # 如果是奇数，减1
        num_patches -= 1
    return num_patches * patch_size
 def make_batched_videos(videos) -> List[VideoInput]:
    if (
        isinstance(videos, (list, tuple))
        and isinstance(videos[0], (list, tuple))
        and is_valid_image(videos[0][0])
    ):
        return videos
    elif isinstance(videos, (list, tuple)) and is_valid_image(videos[0]):
        if isinstance(videos[0], Image.Image):
            return [videos]
        elif len(videos[0].shape) == 4:
            return [list(video) for video in videos]
    elif is_valid_image(videos) and len(videos.shape) == 4:
        return [list(videos)]
    raise ValueError(f"Could not make batched video from {videos}")
 def smart_resize(
    height: int,
    width: int,
    factor: int = 28,
    min_pixels: int = 28 * 28 * 130,
    max_pixels: int = 28 * 28 * 1280,
 ):
    """Rescales the image so that the following conditions are met:
    1. Both dimensions (height and width) are divisible by 'factor'.
    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
    3. The aspect ratio of the image is maintained as closely as possible.
    """
    # if height < factor or width < factor:
    #    raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    # if int(height < factor//4) + int(width < factor//4):
    #     raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor//4}")
    if height < factor:
        print(f"smart_resize: height={height} < factor={factor}, reset height=factor")
        width = round((width * factor) / height)
        height = factor
    if width < factor:
        print(f"smart_resize: width={width} < factor={factor}, reset width=factor")
        height = round((height * factor) / width)
        width = factor
    if max(height, width) / min(height, width) > 200:
        raise ValueError(
            f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
        )
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar
 class SiglipImageProcessor(BaseImageProcessor):
    r"""
    Constructs a Siglip image processor that dynamically resizes images based on the original images.
    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use when resizing the image.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
            Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
            Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
        min_pixels (`int`, *optional*, defaults to `28 * 28 * 130`):
            The min pixels of the image to resize the image.
        max_pixels (`int`, *optional*, defaults to `28 * 28 * 1670`):
            The max pixels of the image to resize the image.
        patch_size (`int`, *optional*, defaults to 14):
            The spacial patch size of the vision encoder.
        temporal_patch_size (`int`, *optional*, defaults to 2):
            The temporal patch size of the vision encoder.
        merge_size (`int`, *optional*, defaults to 2):
            The merge size of the vision encoder to llm encoder.
    """
    model_input_names = [
        "pixel_values",
        "image_grid_thw",
        "pixel_values_videos",
        "video_grid_thw",
    ]
    def __init__(
        self,
        do_resize: bool = True,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = True,
        min_pixels: int = 28 * 28 * 130,
        max_pixels: int = 28 * 28 * 1280,
        patch_size: int = 14,
        temporal_patch_size: int = 1,
        merge_size: int = 2,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.do_resize = do_resize
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.min_pixels = min_pixels
        self.max_pixels = max_pixels
        self.patch_size = patch_size
        self.temporal_patch_size = temporal_patch_size
        self.merge_size = merge_size
        self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels}  # not used
        self.do_convert_rgb = do_convert_rgb
    def mvit_rescale(self, image: Image.Image, merge_size: int = 2) -> Image.Image:
        try:
            w, h = image.size
        except:
            raise ValueError(str((type(image), image)))
        patch_size = self.patch_size
        if (w // patch_size) * (h // patch_size) > self.in_token_limit:
            scale = math.sqrt(
                self.in_token_limit / ((w // patch_size) * (h // patch_size))
            )
            new_w, new_h = int(w * scale), int(h * scale)
            image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
        if self.pad_input:
            new_w, new_h = image.size
            pad_size_h = merge_size * patch_size
            pad_size_w = merge_size * patch_size
            pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
            pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
            image = TF.pad(image, (0, 0, pad_w, pad_h))
        else:
            new_w, new_h = image.size
            new_w = new_w - new_w % patch_size
            new_h = new_h - new_h % patch_size
            new_w = adjust_size(new_w, patch_size)
            new_h = adjust_size(new_h, patch_size)
            image = TF.center_crop(image, (new_h, new_w))
        w, h = image.size
        if w // patch_size >= 512 or h // patch_size >= 512:
            new_h = min(patch_size * 510, h)
            new_w = min(patch_size * 510, w)
            image = TF.center_crop(image, (new_h, new_w))
            # raise ValueError("Exceed pos emb")
        return image
    def _preprocess(
        self,
        images: Union[ImageInput, VideoInput],
        do_resize: bool = None,
        resample: PILImageResampling = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
        Args:
            images (`ImageInput`):
                Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
            vision_info (`List[Dict]`, *optional*):
                Optional list of dictionaries containing additional information about vision inputs.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Scale factor to use if rescaling the image.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.   - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        images = make_list_of_images(images)
        if do_convert_rgb:
            images = [convert_to_rgb(image) for image in images]
        # All transformations expect numpy arrays.
        images = [to_numpy_array(image) for image in images]
        if is_scaled_image(images[0]) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )
        if input_data_format is None:
            # We assume that all images have the same channel dimension format.
            input_data_format = infer_channel_dimension_format(images[0])
        height, width = get_image_size(images[0], channel_dim=input_data_format)
        resized_height, resized_width = height, width
        processed_images = []
        for image in images:
            if do_resize:
                resized_height, resized_width = smart_resize(
                    height,
                    width,
                    factor=self.patch_size * self.merge_size,
                    min_pixels=self.min_pixels,
                    max_pixels=self.max_pixels,
                )
                image = resize(
                    image,
                    size=(resized_height, resized_width),
                    resample=resample,
                    input_data_format=input_data_format,
                )
            if do_rescale:
                image = self.rescale(
                    image, scale=rescale_factor, input_data_format=input_data_format
                )
            if do_normalize:
                image = self.normalize(
                    image=image,
                    mean=image_mean,
                    std=image_std,
                    input_data_format=input_data_format,
                )
            image = to_channel_dimension_format(
                image, data_format, input_channel_dim=input_data_format
            )
            processed_images.append(image)
        patches = np.array(processed_images)
        if data_format == ChannelDimension.LAST:
            patches = patches.transpose(0, 3, 1, 2)
        if patches.shape[0] == 1:
            patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))
        init_patches = patches
        channel = patches.shape[1]
        grid_t = patches.shape[0] // self.temporal_patch_size
        grid_h, grid_w = (
            resized_height // self.patch_size,
            resized_width // self.patch_size,
        )
        patches = patches.reshape(
            grid_t,
            self.temporal_patch_size,
            channel,
            grid_h,
            self.patch_size,
            grid_w,
            self.patch_size,
        )
        patches = patches.transpose(0, 3, 5, 2, 1, 4, 6)
        assert self.temporal_patch_size == 1
        flatten_patches = patches.reshape(
            grid_t * grid_h * grid_w, channel, self.patch_size, self.patch_size
        )
        return flatten_patches, (grid_t, grid_h, grid_w)
    def preprocess(
        self,
        images: ImageInput,
        videos: VideoInput = None,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            videos (`VideoInput`):
                Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
                passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
                the longest edge resized to keep the input aspect ratio.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:
                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        resample = resample if resample is not None else self.resample
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = (
            rescale_factor if rescale_factor is not None else self.rescale_factor
        )
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_convert_rgb = (
            do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
        )
        if images is not None:
            images = make_batched_images(images)
        if videos is not None:
            videos = make_batched_videos(videos)
        if images is not None and not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        validate_preprocess_arguments(
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )
        if images is not None:
            pixel_values, vision_grid_thws = [], []
            for image in images:
                patches, image_grid_thw = self._preprocess(
                    image,
                    do_resize=do_resize,
                    resample=resample,
                    do_rescale=do_rescale,
                    rescale_factor=rescale_factor,
                    do_normalize=do_normalize,
                    image_mean=image_mean,
                    image_std=image_std,
                    data_format=data_format,
                    do_convert_rgb=do_convert_rgb,
                    input_data_format=input_data_format,
                )
                pixel_values.extend(patches)
                vision_grid_thws.append(image_grid_thw)
            pixel_values = np.array(pixel_values)
            vision_grid_thws = np.array(vision_grid_thws)
            data = {"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws}
        if videos is not None:
            pixel_values, vision_grid_thws = [], []
            for images in videos:
                patches, video_grid_thw = self._preprocess(
                    images,
                    do_resize=do_resize,
                    resample=resample,
                    do_rescale=do_rescale,
                    rescale_factor=rescale_factor,
                    do_normalize=do_normalize,
                    image_mean=image_mean,
                    image_std=image_std,
                    data_format=data_format,
                    do_convert_rgb=do_convert_rgb,
                    input_data_format=input_data_format,
                )
                pixel_values.extend(patches)
                vision_grid_thws.append(video_grid_thw)
            pixel_values = np.array(pixel_values)
            vision_grid_thws = np.array(vision_grid_thws)
            data = {
                "pixel_values_videos": pixel_values,
                "video_grid_thw": vision_grid_thws,
            }
        return BatchFeature(data=data, tensor_type=return_tensors)
--- a/inference.yml
+++ b/inference.yml
@ -0,0 +1,2 @@
 Global:
  model_name: PaddleOCR-VL-0.9B
--- a/model.safetensors
+++ b/model.safetensors
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:3085f1042e184f68f8a412aa0f64f2c4b8562989598bbfba326aaa11fc685de8
 size 1917255968
--- a/modeling_paddleocr_vl.py
+++ b/modeling_paddleocr_vl.py
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@ -0,0 +1,33 @@
 {
  "auto_map": {
    "AutoImageProcessor": "image_processing.SiglipImageProcessor",
    "AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
  },
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "SiglipImageProcessor",
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "max_pixels": 2822400,
  "merge_size": 2,
  "min_pixels": 147384,
  "patch_size": 14,
  "processor_class": "PaddleOCRVLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 2822400,
    "min_pixels": 147384
  },
  "temporal_patch_size": 1
 }
--- a/processing_paddleocr_vl.py
+++ b/processing_paddleocr_vl.py
@ -0,0 +1,293 @@
 # Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import List, Union
 import numpy as np
 import torch
 from transformers.feature_extraction_utils import BatchFeature
 from transformers.processing_utils import (
    ProcessingKwargs,
    ProcessorMixin,
    Unpack,
    VideosKwargs,
 )
 from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
 ImageInput = Union[
    "PIL.Image.Image",
    np.ndarray,
    "torch.Tensor",
    List["PIL.Image.Image"],
    List[np.ndarray],
    List["torch.Tensor"],
 ]  # noqa
 VideoInput = Union[
    List["PIL.Image.Image"],
    "np.ndarray",
    "torch.Tensor",
    List["np.ndarray"],
    List["torch.Tensor"],
    List[List["PIL.Image.Image"]],
    List[List["np.ndarrray"]],
    List[List["torch.Tensor"]],
 ]  # noqa
 class PaddleOCRVLVideosProcessorKwargs(VideosKwargs, total=False):
    fps: Union[List[float], float]
 class PaddleOCRVLProcessorKwargs(ProcessingKwargs, total=False):
    videos_kwargs: PaddleOCRVLVideosProcessorKwargs
    _defaults = {
        "text_kwargs": {
            "padding": False,
        },
        "videos_kwargs": {"fps": 2.0},
    }
 class PaddleOCRVLProcessor(ProcessorMixin):
    r"""
    [`PaddleOCRVLProcessor`] offers all the functionalities of [`SiglipImageProcessor`] and [`Qwen2TokenizerFast`]. See the
    [`~PaddleOCRVLProcessor.__call__`] and [`~PaddleOCRVLProcessor.decode`] for more information.
    Args:
        image_processor ([`SiglipImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`Qwen2TokenizerFast`], *optional*):
            The tokenizer is a required input.
        chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
            in a chat into a tokenizable string.
    """
    attributes = ["image_processor", "tokenizer"]
    valid_kwargs = [
        "chat_template",
        "image_std",
        "min_pixels",
        "image_mean",
        "merge_size",
        "image_processor_type",
        "temporal_patch_size",
        "patch_size",
        "max_pixels",
    ]
    image_processor_class = "AutoImageProcessor"
    tokenizer_class = "AutoTokenizer"
    def __init__(
        self, image_processor=None, tokenizer=None, chat_template=None, **kwargs
    ):
        self.image_token = (
            "<|IMAGE_PLACEHOLDER|>"
            if not hasattr(tokenizer, "image_token")
            else tokenizer.image_token
        )
        self.video_token = (
            "<|video_pad|>"
            if not hasattr(tokenizer, "video_token")
            else tokenizer.video_token
        )
        super().__init__(image_processor, tokenizer, chat_template=chat_template)
    def __call__(
        self,
        images: ImageInput = None,
        text: Union[
            TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]
        ] = None,
        videos: VideoInput = None,
        **kwargs: Unpack[PaddleOCRVLProcessorKwargs],
    ) -> BatchFeature:
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
        SiglipImageProcessor's [`~SiglipImageProcessor.__call__`] if `vision_infos` is not `None`.
        Args:
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
                tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
            - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
            - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
            - **second_per_grid_ts** -- List of video seconds per time grid. Returned when `videos` is not `None`.
        """
        output_kwargs = self._merge_kwargs(
            PaddleOCRVLProcessorKwargs,
            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
            **kwargs,
        )
        if images is not None:
            image_inputs = self.image_processor(images=images, return_tensors="pt")
            image_inputs["pixel_values"] = image_inputs["pixel_values"]
            image_grid_thw = image_inputs["image_grid_thw"]
        else:
            image_inputs = {}
            image_grid_thw = None
        if videos is not None:
            # TODO: add video processing
            videos_inputs = self.image_processor(
                images=None, videos=videos, **output_kwargs["images_kwargs"]
            )
            video_grid_thw = videos_inputs["video_grid_thw"]
            fps = output_kwargs["videos_kwargs"].pop("fps", 2.0)
            if isinstance(fps, (int, float)):
                second_per_grid_ts = [
                    self.image_processor.temporal_patch_size / fps
                ] * len(video_grid_thw)
            elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
                second_per_grid_ts = [
                    self.image_processor.temporal_patch_size / tmp for tmp in fps
                ]
            else:
                raise ValueError(
                    f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number."
                )
            videos_inputs.update(
                {"second_per_grid_ts": torch.tensor(second_per_grid_ts)}
            )
        else:
            videos_inputs = {}
            video_grid_thw = None
        if not isinstance(text, list):
            text = [text]
        if image_grid_thw is not None:
            index = 0
            for i in range(len(text)):
                while self.image_token in text[i]:
                    text[i] = text[i].replace(
                        self.image_token,
                        "<|placeholder|>"
                        * (
                            image_grid_thw[index].prod()
                            // self.image_processor.merge_size
                            // self.image_processor.merge_size
                        ),
                        1,
                    )
                    index += 1
                text[i] = text[i].replace("<|placeholder|>", self.image_token)
        if video_grid_thw is not None:
            index = 0
            for i in range(len(text)):
                while self.video_token in text[i]:
                    text[i] = text[i].replace(
                        self.video_token,
                        "<|placeholder|>"
                        * (
                            video_grid_thw[index].prod()
                            // self.image_processor.merge_size
                            // self.image_processor.merge_size
                        ),
                        1,
                    )
                    index += 1
                text[i] = text[i].replace("<|placeholder|>", self.video_token)
        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
        return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})
    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)
    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)
    def post_process_image_text_to_text(
        self,
        generated_outputs,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
        **kwargs,
    ):
        """
        Post-process the output of the model to decode the text.
        Args:
            generated_outputs (`torch.Tensor` or `np.ndarray`):
                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
                or `(sequence_length,)`.
            skip_special_tokens (`bool`, *optional*, defaults to `True`):
                Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
            Clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
                Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
            **kwargs:
                Additional arguments to be passed to the tokenizer's `batch_decode method`.
        Returns:
            `List[str]`: The decoded text.
        """
        return self.tokenizer.batch_decode(
            generated_outputs,
            skip_special_tokens=skip_special_tokens,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs,
        )
    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        names_from_processor = list(
            dict.fromkeys(tokenizer_input_names + image_processor_input_names)
        )
        return names_from_processor + ["second_per_grid_ts"]
 __all__ = ["PaddleOCRVLProcessor", "PaddleOCRVLProcessor"]
--- a/processor_config.json
+++ b/processor_config.json
@ -0,0 +1,6 @@
 {
  "auto_map": {
    "AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
  },
  "processor_class": "PaddleOCRVLProcessor"
 }
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@ -0,0 +1,58 @@
 {
  "additional_special_tokens": [
    "<|IMAGE_PLACEHOLDER|>",
    "<|image_pad|>",
    "<|IMAGE_START|>",
    "<|IMAGE_END|>",
    "<|video_pad|>"
  ],
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "cls_token": {
    "content": "<|begin_of_sentence|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "mask_token": {
    "content": "<mask:1>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "sep_token": {
    "content": "<|end_of_sentence|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:f90f04fd8e5eb6dfa380f37d10c87392de8438dccb6768a2486b5a96ee76dba6
 size 11187679
--- a/tokenizer.model
+++ b/tokenizer.model
@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:34ef7db83df785924fb83d7b887b6e822a031c56e15cff40aaf9b982988180df
 size 1614363
--- a/tokenizer_config.json
+++ b/tokenizer_config.json