Upload to PaddlePaddle/PaddleOCR-VL on ModelScope hub

This commit is contained in:
TingquanGao
2025-10-21 10:12:45 +00:00
parent 78c582ff62
commit f172aef5ac
18 changed files with 13342 additions and 34 deletions

6
.gitattributes vendored
View File

@ -53,4 +53,8 @@ imgs/overview4.jpg filter=lfs diff=lfs merge=lfs -text
imgs/table_01.jpg filter=lfs diff=lfs merge=lfs -text imgs/table_01.jpg filter=lfs diff=lfs merge=lfs -text
PP-DocLayoutV2/inference.pdiparams filter=lfs diff=lfs merge=lfs -text PP-DocLayoutV2/inference.pdiparams filter=lfs diff=lfs merge=lfs -text
imgs/text_english_arabic.jpg filter=lfs diff=lfs merge=lfs -text imgs/text_english_arabic.jpg filter=lfs diff=lfs merge=lfs -text
PP-DocLayoutV2/inference.pdmodel filter=lfs diff=lfs merge=lfs -text PP-DocLayoutV2/inference.pdmodel filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
tokenizer.model filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text

View File

@ -3,6 +3,7 @@ license: apache-2.0
pipeline_tag: image-text-to-text pipeline_tag: image-text-to-text
tags: tags:
- ERNIE4.5 - ERNIE4.5
- PaddleOCR
- PaddlePaddle - PaddlePaddle
- image-to-text - image-to-text
- ocr - ocr
@ -16,6 +17,7 @@ language:
- en - en
- zh - zh
- multilingual - multilingual
library_name: PaddleOCR
--- ---
<div align="center"> <div align="center">
@ -42,7 +44,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
</div> </div>
<div align="center"> <div align="center">
<img src="./imgs/allmetric.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/allmetric.png" width="800"/>
</div> </div>
## Introduction ## Introduction
@ -65,7 +67,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
<!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. --> <!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. -->
<div align="center"> <div align="center">
<img src="./imgs/paddleocrvl.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/paddleocrvl.png" width="800"/>
</div> </div>
@ -100,7 +102,6 @@ Python API usage:
```python ```python
from paddleocr import PaddleOCRVL from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL() pipeline = PaddleOCRVL()
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png") output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
for res in output: for res in output:
@ -120,7 +121,6 @@ for res in output:
--network host \ --network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server
``` ```
2. Call the PaddleOCR CLI or Python API: 2. Call the PaddleOCR CLI or Python API:
```bash ```bash
@ -129,10 +129,8 @@ for res in output:
--vl_rec_backend vllm-server \ --vl_rec_backend vllm-server \
--vl_rec_server_url http://127.0.0.1:8080/v1 --vl_rec_server_url http://127.0.0.1:8080/v1
``` ```
```python ```python
from paddleocr import PaddleOCRVL from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1") pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png") output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
for res in output: for res in output:
@ -140,7 +138,8 @@ for res in output:
res.save_to_json(save_path="output") res.save_to_json(save_path="output")
res.save_to_markdown(save_path="output") res.save_to_markdown(save_path="output")
``` ```
**For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).**
## Performance ## Performance
### Page-Level Document Parsing ### Page-Level Document Parsing
@ -151,19 +150,18 @@ for res in output:
##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5 ##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
<div align="center"> <div align="center">
<img src="./imgs/omni15.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni15.png" width="800"/>
</div> </div>
#### 2. OmniDocBench v1.0 #### 2. OmniDocBench v1.0
##### PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0 ##### PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0
<div align="center"> <div align="center">
<img src="./imgs/omni10.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni10.png" width="800"/>
</div> </div>
@ -180,7 +178,7 @@ for res in output:
PaddleOCR-VLs robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation. PaddleOCR-VLs robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
<div align="center"> <div align="center">
<img src="./imgs/omnibenchocr.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omnibenchocr.png" width="800"/>
</div> </div>
@ -189,7 +187,7 @@ PaddleOCR-VLs robust and versatile capability in handling diverse document ty
In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts. In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
<div align="center"> <div align="center">
<img src="./imgs/inhouseocr.png" width="800"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouseocr.png" width="800"/>
</div> </div>
@ -201,7 +199,7 @@ In-house-OCR provides a evaluation of performance across multiple languages and
Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories. Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
<div align="center"> <div align="center">
<img src="./imgs/inhousetable.png" width="600"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousetable.png" width="600"/>
</div> </div>
#### 3. Formula #### 3. Formula
@ -211,7 +209,7 @@ Our self-built evaluation set contains diverse types of table images, such as Ch
In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category. In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
<div align="center"> <div align="center">
<img src="./imgs/inhouse-formula.png" width="500"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouse-formula.png" width="500"/>
</div> </div>
@ -222,7 +220,7 @@ In-house-Formula evaluation set contains simple prints, complex prints, camera s
The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models. The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
<div align="center"> <div align="center">
<img src="./imgs/inhousechart.png" width="400"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousechart.png" width="400"/>
</div> </div>
@ -237,42 +235,42 @@ The evaluation set is broadly categorized into 11 chart categories, including ba
### Comprehensive Document Parsing ### Comprehensive Document Parsing
<div align="center"> <div align="center">
<img src="./imgs/overview1.jpg" width="600"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview1.jpg" width="600"/>
<img src="./imgs/overview2.jpg" width="600"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview2.jpg" width="600"/>
<img src="./imgs/overview3.jpg" width="600"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview3.jpg" width="600"/>
<img src="./imgs/overview4.jpg" width="600"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview4.jpg" width="600"/>
</div> </div>
### Text ### Text
<div align="center"> <div align="center">
<img src="./imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
<img src="./imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
</div> </div>
### Table ### Table
<div align="center"> <div align="center">
<img src="./imgs/table_01.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
<img src="./imgs/table_02.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
</div> </div>
### Formula ### Formula
<div align="center"> <div align="center">
<img src="./imgs/formula_EN.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
<img src="./imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
</div> </div>
### Chart ### Chart
<div align="center"> <div align="center">
<img src="./imgs/chart_01.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
<img src="./imgs/chart_02.jpg" width="300" style="display: inline-block;"/> <img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
</div> </div>
@ -285,11 +283,13 @@ We would like to thank [ERNIE](https://github.com/PaddlePaddle/ERNIE), [Keye](ht
If you find PaddleOCR-VL helpful, feel free to give us a star and citation. If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
```bibtex ```bibtex
@misc{paddleocrvl2025technicalreport, @misc{cui2025paddleocrvlboostingmultilingualdocument,
title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model}, title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
author={Cui, C. et al.}, author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2025}, year={2025},
primaryClass={cs.CL}, eprint={2510.14528},
howpublished={\url{https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf}} archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14528},
} }
``` ```

1021
added_tokens.json Normal file

File diff suppressed because it is too large Load Diff

22
chat_template.jinja Normal file
View File

@ -0,0 +1,22 @@
{%- if not add_generation_prompt is defined -%}
{%- set add_generation_prompt = true -%}
{%- endif -%}
{%- if not cls_token is defined -%}
{%- set cls_token = "<|begin_of_sentence|>" -%}
{%- endif -%}
{%- if not sep_token is defined -%}
{%- set sep_token = "<|end_of_sentence|>" -%}
{%- endif -%}
{{- cls_token -}}
{%- for message in messages -%}
{%- if message["role"] == "user" -%}
{{- "User: <|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" + message["content"] + "\n" -}}
{%- elif message["role"] == "assistant" -%}
{{- "Assistant: " + message["content"] + sep_token -}}
{%- elif message["role"] == "system" -%}
{{- message["content"] -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- "Assistant: " -}}
{%- endif -%}

75
config.json Normal file
View File

@ -0,0 +1,75 @@
{
"architectures": [
"PaddleOCRVLForConditionalGeneration"
],
"attention_probs_dropout_prob": 0.0,
"auto_map": {
"AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
"AutoModel": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration",
"AutoModelForCausalLM": "modeling_paddleocr_vl.PaddleOCRVLForConditionalGeneration"
},
"compression_ratio": 1.0,
"head_dim": 128,
"hidden_act": "silu",
"hidden_dropout_prob": 0.0,
"hidden_size": 1024,
"ignored_index": -100,
"image_token_id": 100295,
"intermediate_size": 3072,
"max_position_embeddings": 131072,
"max_sequence_length": null,
"model_type": "paddleocr_vl",
"num_attention_heads": 16,
"num_hidden_layers": 18,
"num_key_value_heads": 2,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 500000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.55.0",
"use_bias": false,
"use_cache": false,
"use_flash_attention": false,
"video_token_id": 101307,
"vision_config": {
"architectures": [
"SiglipVisionModel"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_paddleocr_vl.PaddleOCRVLConfig",
"AutoModel": "modeling_paddleocr_vl.SiglipVisionModel"
},
"hidden_act": "gelu_pytorch_tanh",
"hidden_size": 1152,
"image_size": 384,
"intermediate_size": 4304,
"layer_norm_eps": 1e-06,
"model_type": "paddleocr_vl",
"num_attention_heads": 16,
"num_channels": 3,
"num_hidden_layers": 27,
"pad_token_id": 0,
"patch_size": 14,
"spatial_merge_size": 2,
"temporal_patch_size": 2,
"tokens_per_second": 2,
"torch_dtype": "bfloat16"
},
"vision_start_token_id": 101305,
"vocab_size": 103424,
"weight_share_add_bias": true,
"use_3d_rope": true,
"rope_is_neox_style": true
}

View File

@ -0,0 +1,191 @@
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from transformers.configuration_utils import PretrainedConfig
from transformers.modeling_rope_utils import rope_config_validation
class PaddleOCRVisionConfig(PretrainedConfig):
model_type = "paddleocr_vl"
base_config_key = "vision_config"
def __init__(
self,
hidden_size=768,
intermediate_size=3072,
num_hidden_layers=12,
num_attention_heads=12,
num_channels=3,
image_size=224,
patch_size=14,
hidden_act="gelu_pytorch_tanh",
layer_norm_eps=1e-6,
attention_dropout=0.0,
spatial_merge_size=2,
temporal_patch_size=2,
tokens_per_second=2,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.spatial_merge_size = spatial_merge_size
self.temporal_patch_size = temporal_patch_size
self.tokens_per_second = tokens_per_second
class PaddleOCRVLConfig(PretrainedConfig):
"""
Configuration class.
This class stores the configuration of an Ernie model, defining the model architecture.
It inherits from PretrainedConfig and can be used to control model outputs.
"""
model_type = "paddleocr_vl"
keys_to_ignore_at_inference = ["past_key_values"]
sub_configs = {"vision_config": PaddleOCRVisionConfig}
# Default tensor parallel plan for base model `Qwen3`
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
"norm": (["hidden_states"], ["hidden_states"]),
}
def __init__(
self,
vocab_size=32000,
hidden_size=768,
intermediate_size=11008,
max_position_embeddings=32768,
num_hidden_layers=2,
num_attention_heads=2,
image_token_id=101304,
video_token_id=101305,
vision_start_token_id=101306,
rms_norm_eps=1e-6,
use_cache=False,
use_flash_attention=False,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
head_dim=128,
hidden_act="silu",
use_bias=False,
rope_theta=10000,
weight_share_add_bias=True,
ignored_index=-100,
attention_probs_dropout_prob=0.0,
hidden_dropout_prob=0.0,
compression_ratio: float = 1.0,
num_key_value_heads=None,
max_sequence_length=None,
tie_word_embeddings=False,
vision_config=None,
rope_scaling=None,
**kwargs,
):
"""
Initialize configuration with default or specified parameters.
Args:
vocab_size (int): Size of the vocabulary (number of unique tokens)
hidden_size (int): Dimensionality of the encoder layers and the pooler layer
intermediate_size (int): Dimensionality of the "intermediate" (feed-forward) layer
max_position_embeddings (int): Maximum sequence length the model can handle
num_hidden_layers (int): Number of hidden layers in the Transformer encoder
num_attention_heads (int): Number of attention heads for each attention layer
rms_norm_eps (float): The epsilon used by the RMS normalization layers
use_cache (bool): Whether to use caching for faster generation (decoding)
use_flash_attention (bool): Whether to use FlashAttention for optimized attention computation
pad_token_id (int): Token ID used for padding sequences
bos_token_id (int): Token ID used for beginning-of-sequence
eos_token_id (int): Token ID used for end-of-sequence
use_bias (bool): Whether to use bias terms in linear layers
rope_theta (float): The base period of the RoPE embeddings
weight_share_add_bias (bool): Whether to share bias weights in certain layers
ignored_index (int): Target value that is ignored during loss computation
attention_probs_dropout_prob (float): Dropout probability for attention weights
hidden_dropout_prob (float): Dropout probability for hidden layers
compression_ratio (float): Ratio for KV cache compression (1.0 = no compression)
num_key_value_heads (int): Number of key/value heads (for Grouped Query Attention)
max_sequence_length (int): Maximum sequence length for positional embeddings
**kwargs: Additional keyword arguments passed to parent class
"""
# Set default for tied embeddings if not specified.
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
**kwargs,
)
if isinstance(vision_config, dict):
self.vision_config = self.sub_configs["vision_config"](**vision_config)
elif vision_config is None:
self.vision_config = self.sub_configs["vision_config"]()
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.max_position_embeddings = max_position_embeddings
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.use_flash_attention = use_flash_attention
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id
self.image_token_id = image_token_id
self.video_token_id = video_token_id
self.vision_start_token_id = vision_start_token_id
self.head_dim = head_dim
self.hidden_act=hidden_act
self.sliding_window = None
self.hidden_size = hidden_size
self.use_bias = use_bias
self.weight_share_add_bias = weight_share_add_bias
self.rope_theta = rope_theta
self.ignored_index = ignored_index
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.hidden_dropout_prob = hidden_dropout_prob
self.compression_ratio = compression_ratio
self.num_key_value_heads = num_key_value_heads
self.max_sequence_length = max_sequence_length
self.rope_scaling = rope_scaling
if self.rope_scaling is not None and "type" in self.rope_scaling:
if self.rope_scaling["type"] == "mrope":
self.rope_scaling["type"] = "default"
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
rope_config_validation(self, ignore_keys={"mrope_section"})
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)

6
generation_config.json Normal file
View File

@ -0,0 +1,6 @@
{
"_from_model_config": true,
"eos_token_id": 2,
"transformers_version": "4.55.0",
"use_cache": false
}

569
image_processing.py Normal file
View File

@ -0,0 +1,569 @@
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for PaddleOCR-VL."""
import math
from typing import Dict, List, Optional, Union
import numpy as np
import torch
from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
from torchvision.transforms import functional as TF
from transformers.image_transforms import (
convert_to_rgb,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
PILImageResampling,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
is_valid_image,
make_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from transformers.utils import TensorType, is_vision_available, logging
logger = logging.get_logger(__name__)
if is_vision_available():
from PIL import Image
ImageInput = Union[
"PIL.Image.Image",
np.ndarray,
"torch.Tensor",
List["PIL.Image.Image"],
List[np.ndarray],
List["torch.Tensor"],
] # noqa
VideoInput = Union[
List["PIL.Image.Image"],
"np.ndarray",
"torch.Tensor",
List["np.ndarray"],
List["torch.Tensor"],
List[List["PIL.Image.Image"]],
List[List["np.ndarrray"]],
List[List["torch.Tensor"]],
] # noqa
def make_batched_images(images) -> List[List[ImageInput]]:
"""
Accepts images in list or nested list format, and makes a list of images for preprocessing.
Args:
images (`Union[List[List[ImageInput]], List[ImageInput], ImageInput]`):
The input image.
Returns:
list: A list of images.
"""
if (
isinstance(images, (list, tuple))
and isinstance(images[0], (list, tuple))
and is_valid_image(images[0][0])
):
return [img for img_list in images for img in img_list]
elif isinstance(images, (list, tuple)) and is_valid_image(images[0]):
return images
elif is_valid_image(images):
return [images]
raise ValueError(f"Could not make batched images from {images}")
def adjust_size(size, patch_size):
num_patches = size // patch_size
if num_patches % 2 != 0: # 如果是奇数减1
num_patches -= 1
return num_patches * patch_size
def make_batched_videos(videos) -> List[VideoInput]:
if (
isinstance(videos, (list, tuple))
and isinstance(videos[0], (list, tuple))
and is_valid_image(videos[0][0])
):
return videos
elif isinstance(videos, (list, tuple)) and is_valid_image(videos[0]):
if isinstance(videos[0], Image.Image):
return [videos]
elif len(videos[0].shape) == 4:
return [list(video) for video in videos]
elif is_valid_image(videos) and len(videos.shape) == 4:
return [list(videos)]
raise ValueError(f"Could not make batched video from {videos}")
def smart_resize(
height: int,
width: int,
factor: int = 28,
min_pixels: int = 28 * 28 * 130,
max_pixels: int = 28 * 28 * 1280,
):
"""Rescales the image so that the following conditions are met:
1. Both dimensions (height and width) are divisible by 'factor'.
2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
3. The aspect ratio of the image is maintained as closely as possible.
"""
# if height < factor or width < factor:
# raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
# if int(height < factor//4) + int(width < factor//4):
# raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor//4}")
if height < factor:
print(f"smart_resize: height={height} < factor={factor}, reset height=factor")
width = round((width * factor) / height)
height = factor
if width < factor:
print(f"smart_resize: width={width} < factor={factor}, reset width=factor")
height = round((height * factor) / width)
width = factor
if max(height, width) / min(height, width) > 200:
raise ValueError(
f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
)
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = math.floor(height / beta / factor) * factor
w_bar = math.floor(width / beta / factor) * factor
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = math.ceil(height * beta / factor) * factor
w_bar = math.ceil(width * beta / factor) * factor
return h_bar, w_bar
class SiglipImageProcessor(BaseImageProcessor):
r"""
Constructs a Siglip image processor that dynamically resizes images based on the original images.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use when resizing the image.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
min_pixels (`int`, *optional*, defaults to `28 * 28 * 130`):
The min pixels of the image to resize the image.
max_pixels (`int`, *optional*, defaults to `28 * 28 * 1670`):
The max pixels of the image to resize the image.
patch_size (`int`, *optional*, defaults to 14):
The spacial patch size of the vision encoder.
temporal_patch_size (`int`, *optional*, defaults to 2):
The temporal patch size of the vision encoder.
merge_size (`int`, *optional*, defaults to 2):
The merge size of the vision encoder to llm encoder.
"""
model_input_names = [
"pixel_values",
"image_grid_thw",
"pixel_values_videos",
"video_grid_thw",
]
def __init__(
self,
do_resize: bool = True,
resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = True,
min_pixels: int = 28 * 28 * 130,
max_pixels: int = 28 * 28 * 1280,
patch_size: int = 14,
temporal_patch_size: int = 1,
merge_size: int = 2,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.do_resize = do_resize
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.min_pixels = min_pixels
self.max_pixels = max_pixels
self.patch_size = patch_size
self.temporal_patch_size = temporal_patch_size
self.merge_size = merge_size
self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels} # not used
self.do_convert_rgb = do_convert_rgb
def mvit_rescale(self, image: Image.Image, merge_size: int = 2) -> Image.Image:
try:
w, h = image.size
except:
raise ValueError(str((type(image), image)))
patch_size = self.patch_size
if (w // patch_size) * (h // patch_size) > self.in_token_limit:
scale = math.sqrt(
self.in_token_limit / ((w // patch_size) * (h // patch_size))
)
new_w, new_h = int(w * scale), int(h * scale)
image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
if self.pad_input:
new_w, new_h = image.size
pad_size_h = merge_size * patch_size
pad_size_w = merge_size * patch_size
pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
image = TF.pad(image, (0, 0, pad_w, pad_h))
else:
new_w, new_h = image.size
new_w = new_w - new_w % patch_size
new_h = new_h - new_h % patch_size
new_w = adjust_size(new_w, patch_size)
new_h = adjust_size(new_h, patch_size)
image = TF.center_crop(image, (new_h, new_w))
w, h = image.size
if w // patch_size >= 512 or h // patch_size >= 512:
new_h = min(patch_size * 510, h)
new_w = min(patch_size * 510, w)
image = TF.center_crop(image, (new_h, new_w))
# raise ValueError("Exceed pos emb")
return image
def _preprocess(
self,
images: Union[ImageInput, VideoInput],
do_resize: bool = None,
resample: PILImageResampling = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
Args:
images (`ImageInput`):
Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
vision_info (`List[Dict]`, *optional*):
Optional list of dictionaries containing additional information about vision inputs.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
images = make_list_of_images(images)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if is_scaled_image(images[0]) and do_rescale:
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
height, width = get_image_size(images[0], channel_dim=input_data_format)
resized_height, resized_width = height, width
processed_images = []
for image in images:
if do_resize:
resized_height, resized_width = smart_resize(
height,
width,
factor=self.patch_size * self.merge_size,
min_pixels=self.min_pixels,
max_pixels=self.max_pixels,
)
image = resize(
image,
size=(resized_height, resized_width),
resample=resample,
input_data_format=input_data_format,
)
if do_rescale:
image = self.rescale(
image, scale=rescale_factor, input_data_format=input_data_format
)
if do_normalize:
image = self.normalize(
image=image,
mean=image_mean,
std=image_std,
input_data_format=input_data_format,
)
image = to_channel_dimension_format(
image, data_format, input_channel_dim=input_data_format
)
processed_images.append(image)
patches = np.array(processed_images)
if data_format == ChannelDimension.LAST:
patches = patches.transpose(0, 3, 1, 2)
if patches.shape[0] == 1:
patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))
init_patches = patches
channel = patches.shape[1]
grid_t = patches.shape[0] // self.temporal_patch_size
grid_h, grid_w = (
resized_height // self.patch_size,
resized_width // self.patch_size,
)
patches = patches.reshape(
grid_t,
self.temporal_patch_size,
channel,
grid_h,
self.patch_size,
grid_w,
self.patch_size,
)
patches = patches.transpose(0, 3, 5, 2, 1, 4, 6)
assert self.temporal_patch_size == 1
flatten_patches = patches.reshape(
grid_t * grid_h * grid_w, channel, self.patch_size, self.patch_size
)
return flatten_patches, (grid_t, grid_h, grid_w)
def preprocess(
self,
images: ImageInput,
videos: VideoInput = None,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
videos (`VideoInput`):
Video to preprocess. Expects a single or batch of videos with pixel values ranging from 0 to 255. If
passing in videos with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
the longest edge resized to keep the input aspect ratio.
resample (`int`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = (
rescale_factor if rescale_factor is not None else self.rescale_factor
)
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = (
do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
)
if images is not None:
images = make_batched_images(images)
if videos is not None:
videos = make_batched_videos(videos)
if images is not None and not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
if images is not None:
pixel_values, vision_grid_thws = [], []
for image in images:
patches, image_grid_thw = self._preprocess(
image,
do_resize=do_resize,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
data_format=data_format,
do_convert_rgb=do_convert_rgb,
input_data_format=input_data_format,
)
pixel_values.extend(patches)
vision_grid_thws.append(image_grid_thw)
pixel_values = np.array(pixel_values)
vision_grid_thws = np.array(vision_grid_thws)
data = {"pixel_values": pixel_values, "image_grid_thw": vision_grid_thws}
if videos is not None:
pixel_values, vision_grid_thws = [], []
for images in videos:
patches, video_grid_thw = self._preprocess(
images,
do_resize=do_resize,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
data_format=data_format,
do_convert_rgb=do_convert_rgb,
input_data_format=input_data_format,
)
pixel_values.extend(patches)
vision_grid_thws.append(video_grid_thw)
pixel_values = np.array(pixel_values)
vision_grid_thws = np.array(vision_grid_thws)
data = {
"pixel_values_videos": pixel_values,
"video_grid_thw": vision_grid_thws,
}
return BatchFeature(data=data, tensor_type=return_tensors)

2
inference.yml Normal file
View File

@ -0,0 +1,2 @@
Global:
model_name: PaddleOCR-VL-0.9B

3
model.safetensors Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3085f1042e184f68f8a412aa0f64f2c4b8562989598bbfba326aaa11fc685de8
size 1917255968

2674
modeling_paddleocr_vl.py Normal file

File diff suppressed because it is too large Load Diff

33
preprocessor_config.json Normal file
View File

@ -0,0 +1,33 @@
{
"auto_map": {
"AutoImageProcessor": "image_processing.SiglipImageProcessor",
"AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
},
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "SiglipImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"max_pixels": 2822400,
"merge_size": 2,
"min_pixels": 147384,
"patch_size": 14,
"processor_class": "PaddleOCRVLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"max_pixels": 2822400,
"min_pixels": 147384
},
"temporal_patch_size": 1
}

293
processing_paddleocr_vl.py Normal file
View File

@ -0,0 +1,293 @@
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Union
import numpy as np
import torch
from transformers.feature_extraction_utils import BatchFeature
from transformers.processing_utils import (
ProcessingKwargs,
ProcessorMixin,
Unpack,
VideosKwargs,
)
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
ImageInput = Union[
"PIL.Image.Image",
np.ndarray,
"torch.Tensor",
List["PIL.Image.Image"],
List[np.ndarray],
List["torch.Tensor"],
] # noqa
VideoInput = Union[
List["PIL.Image.Image"],
"np.ndarray",
"torch.Tensor",
List["np.ndarray"],
List["torch.Tensor"],
List[List["PIL.Image.Image"]],
List[List["np.ndarrray"]],
List[List["torch.Tensor"]],
] # noqa
class PaddleOCRVLVideosProcessorKwargs(VideosKwargs, total=False):
fps: Union[List[float], float]
class PaddleOCRVLProcessorKwargs(ProcessingKwargs, total=False):
videos_kwargs: PaddleOCRVLVideosProcessorKwargs
_defaults = {
"text_kwargs": {
"padding": False,
},
"videos_kwargs": {"fps": 2.0},
}
class PaddleOCRVLProcessor(ProcessorMixin):
r"""
[`PaddleOCRVLProcessor`] offers all the functionalities of [`SiglipImageProcessor`] and [`Qwen2TokenizerFast`]. See the
[`~PaddleOCRVLProcessor.__call__`] and [`~PaddleOCRVLProcessor.decode`] for more information.
Args:
image_processor ([`SiglipImageProcessor`], *optional*):
The image processor is a required input.
tokenizer ([`Qwen2TokenizerFast`], *optional*):
The tokenizer is a required input.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = [
"chat_template",
"image_std",
"min_pixels",
"image_mean",
"merge_size",
"image_processor_type",
"temporal_patch_size",
"patch_size",
"max_pixels",
]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self, image_processor=None, tokenizer=None, chat_template=None, **kwargs
):
self.image_token = (
"<|IMAGE_PLACEHOLDER|>"
if not hasattr(tokenizer, "image_token")
else tokenizer.image_token
)
self.video_token = (
"<|video_pad|>"
if not hasattr(tokenizer, "video_token")
else tokenizer.video_token
)
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
images: ImageInput = None,
text: Union[
TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]
] = None,
videos: VideoInput = None,
**kwargs: Unpack[PaddleOCRVLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
SiglipImageProcessor's [`~SiglipImageProcessor.__call__`] if `vision_infos` is not `None`.
Args:
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
- **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
- **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
- **second_per_grid_ts** -- List of video seconds per time grid. Returned when `videos` is not `None`.
"""
output_kwargs = self._merge_kwargs(
PaddleOCRVLProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if images is not None:
image_inputs = self.image_processor(images=images, return_tensors="pt")
image_inputs["pixel_values"] = image_inputs["pixel_values"]
image_grid_thw = image_inputs["image_grid_thw"]
else:
image_inputs = {}
image_grid_thw = None
if videos is not None:
# TODO: add video processing
videos_inputs = self.image_processor(
images=None, videos=videos, **output_kwargs["images_kwargs"]
)
video_grid_thw = videos_inputs["video_grid_thw"]
fps = output_kwargs["videos_kwargs"].pop("fps", 2.0)
if isinstance(fps, (int, float)):
second_per_grid_ts = [
self.image_processor.temporal_patch_size / fps
] * len(video_grid_thw)
elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
second_per_grid_ts = [
self.image_processor.temporal_patch_size / tmp for tmp in fps
]
else:
raise ValueError(
f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number."
)
videos_inputs.update(
{"second_per_grid_ts": torch.tensor(second_per_grid_ts)}
)
else:
videos_inputs = {}
video_grid_thw = None
if not isinstance(text, list):
text = [text]
if image_grid_thw is not None:
index = 0
for i in range(len(text)):
while self.image_token in text[i]:
text[i] = text[i].replace(
self.image_token,
"<|placeholder|>"
* (
image_grid_thw[index].prod()
// self.image_processor.merge_size
// self.image_processor.merge_size
),
1,
)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.image_token)
if video_grid_thw is not None:
index = 0
for i in range(len(text)):
while self.video_token in text[i]:
text[i] = text[i].replace(
self.video_token,
"<|placeholder|>"
* (
video_grid_thw[index].prod()
// self.image_processor.merge_size
// self.image_processor.merge_size
),
1,
)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.video_token)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs})
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
def post_process_image_text_to_text(
self,
generated_outputs,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
**kwargs,
):
"""
Post-process the output of the model to decode the text.
Args:
generated_outputs (`torch.Tensor` or `np.ndarray`):
The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
or `(sequence_length,)`.
skip_special_tokens (`bool`, *optional*, defaults to `True`):
Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
Clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
**kwargs:
Additional arguments to be passed to the tokenizer's `batch_decode method`.
Returns:
`List[str]`: The decoded text.
"""
return self.tokenizer.batch_decode(
generated_outputs,
skip_special_tokens=skip_special_tokens,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
names_from_processor = list(
dict.fromkeys(tokenizer_input_names + image_processor_input_names)
)
return names_from_processor + ["second_per_grid_ts"]
__all__ = ["PaddleOCRVLProcessor", "PaddleOCRVLProcessor"]

6
processor_config.json Normal file
View File

@ -0,0 +1,6 @@
{
"auto_map": {
"AutoProcessor": "processing_paddleocr_vl.PaddleOCRVLProcessor"
},
"processor_class": "PaddleOCRVLProcessor"
}

58
special_tokens_map.json Normal file
View File

@ -0,0 +1,58 @@
{
"additional_special_tokens": [
"<|IMAGE_PLACEHOLDER|>",
"<|image_pad|>",
"<|IMAGE_START|>",
"<|IMAGE_END|>",
"<|video_pad|>"
],
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"cls_token": {
"content": "<|begin_of_sentence|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"mask_token": {
"content": "<mask:1>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"sep_token": {
"content": "<|end_of_sentence|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f90f04fd8e5eb6dfa380f37d10c87392de8438dccb6768a2486b5a96ee76dba6
size 11187679

3
tokenizer.model Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:34ef7db83df785924fb83d7b887b6e822a031c56e15cff40aaf9b982988180df
size 1614363

8345
tokenizer_config.json Normal file

File diff suppressed because it is too large Load Diff