mirror of
https://www.modelscope.cn/PaddlePaddle/PaddleOCR-VL.git
synced 2026-04-02 21:42:54 +08:00
update
This commit is contained in:
96
README.md
96
README.md
@ -44,7 +44,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
|
|||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/allmetric.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/allmetric.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
@ -67,12 +67,13 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
|
|||||||
<!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. -->
|
<!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. -->
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/paddleocrvl.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/paddleocrvl.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
## News
|
## News
|
||||||
* ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
|
* ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
|
||||||
|
* ```2025.10.29``` Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library.
|
||||||
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
@ -140,6 +141,59 @@ for res in output:
|
|||||||
```
|
```
|
||||||
|
|
||||||
**For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).**
|
**For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).**
|
||||||
|
|
||||||
|
## PaddleOCR-VL-0.9B Usage with transformers
|
||||||
|
|
||||||
|
Currently, we support inference using the PaddleOCR-VL-0.9B model with the `transformers` library, which can recognize texts, formulas, tables, and chart elements. In the future, we plan to support full document parsing inference with `transformers`. Below is a simple script we provide to support inference using the PaddleOCR-VL-0.9B model with `transformers`.
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Note: We currently recommend using the official method for inference, as it is faster and supports page-level document parsing. The example code below only supports element-level recognition.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from PIL import Image
|
||||||
|
import torch
|
||||||
|
from transformers import AutoModelForCausalLM, AutoProcessor
|
||||||
|
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
|
||||||
|
CHOSEN_TASK = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
|
||||||
|
PROMPTS = {
|
||||||
|
"ocr": "OCR:",
|
||||||
|
"table": "Table Recognition:",
|
||||||
|
"formula": "Formula Recognition:",
|
||||||
|
"chart": "Chart Recognition:",
|
||||||
|
}
|
||||||
|
|
||||||
|
model_path = "PaddlePaddle/PaddleOCR-VL"
|
||||||
|
image_path = "test.png"
|
||||||
|
image = Image.open(image_path).convert("RGB")
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
|
||||||
|
).to(DEVICE).eval()
|
||||||
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "user",
|
||||||
|
"content": [
|
||||||
|
{"type": "image", "image": image},
|
||||||
|
{"type": "text", "text": PROMPTS[CHOSEN_TASK]},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
inputs = processor.apply_chat_template(
|
||||||
|
messages,
|
||||||
|
tokenize=True,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_dict=True,
|
||||||
|
return_tensors="pt"
|
||||||
|
).to(DEVICE)
|
||||||
|
|
||||||
|
outputs = model.generate(**inputs, max_new_tokens=1024)
|
||||||
|
outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
||||||
|
print(outputs)
|
||||||
|
```
|
||||||
|
|
||||||
## Performance
|
## Performance
|
||||||
|
|
||||||
### Page-Level Document Parsing
|
### Page-Level Document Parsing
|
||||||
@ -150,7 +204,7 @@ for res in output:
|
|||||||
##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
|
##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni15.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omni15.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -161,7 +215,7 @@ for res in output:
|
|||||||
|
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni10.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omni10.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -178,7 +232,7 @@ for res in output:
|
|||||||
PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
|
PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omnibenchocr.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omnibenchocr.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -187,7 +241,7 @@ PaddleOCR-VL’s robust and versatile capability in handling diverse document ty
|
|||||||
In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
|
In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouseocr.png" width="800"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhouseocr.png" width="800"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -199,7 +253,7 @@ In-house-OCR provides a evaluation of performance across multiple languages and
|
|||||||
Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
|
Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousetable.png" width="600"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhousetable.png" width="600"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
#### 3. Formula
|
#### 3. Formula
|
||||||
@ -209,7 +263,7 @@ Our self-built evaluation set contains diverse types of table images, such as Ch
|
|||||||
In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
|
In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouse-formula.png" width="500"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhouse-formula.png" width="500"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -220,7 +274,7 @@ In-house-Formula evaluation set contains simple prints, complex prints, camera s
|
|||||||
The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
|
The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousechart.png" width="400"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhousechart.png" width="400"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -235,42 +289,42 @@ The evaluation set is broadly categorized into 11 chart categories, including ba
|
|||||||
### Comprehensive Document Parsing
|
### Comprehensive Document Parsing
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview1.jpg" width="600"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview1.jpg" width="600"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview2.jpg" width="600"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview2.jpg" width="600"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview3.jpg" width="600"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview3.jpg" width="600"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview4.jpg" width="600"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview4.jpg" width="600"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
### Text
|
### Text
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
### Table
|
### Table
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
### Formula
|
### Formula
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
### Chart
|
### Chart
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
|
||||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
|
<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -7,14 +7,38 @@
|
|||||||
{%- if not sep_token is defined -%}
|
{%- if not sep_token is defined -%}
|
||||||
{%- set sep_token = "<|end_of_sentence|>" -%}
|
{%- set sep_token = "<|end_of_sentence|>" -%}
|
||||||
{%- endif -%}
|
{%- endif -%}
|
||||||
|
{%- if not image_token is defined -%}
|
||||||
|
{%- set image_token = "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" -%}
|
||||||
|
{%- endif -%}
|
||||||
{{- cls_token -}}
|
{{- cls_token -}}
|
||||||
{%- for message in messages -%}
|
{%- for message in messages -%}
|
||||||
{%- if message["role"] == "user" -%}
|
{%- if message["role"] == "user" -%}
|
||||||
{{- "User: <|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" + message["content"] + "\n" -}}
|
{{- "User: " -}}
|
||||||
|
{%- for content in message["content"] -%}
|
||||||
|
{%- if content["type"] == "image" -%}
|
||||||
|
{{ image_token }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- endfor -%}
|
||||||
|
{%- for content in message["content"] -%}
|
||||||
|
{%- if content["type"] == "text" -%}
|
||||||
|
{{ content["text"] }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- endfor -%}
|
||||||
|
{{ "\n" -}}
|
||||||
{%- elif message["role"] == "assistant" -%}
|
{%- elif message["role"] == "assistant" -%}
|
||||||
{{- "Assistant: " + message["content"] + sep_token -}}
|
{{- "Assistant: " -}}
|
||||||
|
{%- for content in message["content"] -%}
|
||||||
|
{%- if content["type"] == "text" -%}
|
||||||
|
{{ content["text"] + "\n" }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- endfor -%}
|
||||||
|
{{ sep_token -}}
|
||||||
{%- elif message["role"] == "system" -%}
|
{%- elif message["role"] == "system" -%}
|
||||||
{{- message["content"] -}}
|
{%- for content in message["content"] -%}
|
||||||
|
{%- if content["type"] == "text" -%}
|
||||||
|
{{ content["text"] + "\n" }}
|
||||||
|
{%- endif -%}
|
||||||
|
{%- endfor -%}
|
||||||
{%- endif -%}
|
{%- endif -%}
|
||||||
{%- endfor -%}
|
{%- endfor -%}
|
||||||
{%- if add_generation_prompt -%}
|
{%- if add_generation_prompt -%}
|
||||||
|
|||||||
Reference in New Issue
Block a user