mirror of
https://www.modelscope.cn/PaddlePaddle/PaddleOCR-VL.git
synced 2026-04-02 21:42:54 +08:00
Upload to PaddlePaddle/PaddleOCR-VL on ModelScope hub
This commit is contained in:
66
README.md
66
README.md
@ -3,6 +3,7 @@ license: apache-2.0
|
||||
pipeline_tag: image-text-to-text
|
||||
tags:
|
||||
- ERNIE4.5
|
||||
- PaddleOCR
|
||||
- PaddlePaddle
|
||||
- image-to-text
|
||||
- ocr
|
||||
@ -16,6 +17,7 @@ language:
|
||||
- en
|
||||
- zh
|
||||
- multilingual
|
||||
library_name: PaddleOCR
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
@ -42,7 +44,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/allmetric.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/allmetric.png" width="800"/>
|
||||
</div>
|
||||
|
||||
## Introduction
|
||||
@ -65,7 +67,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
|
||||
<!-- PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON. -->
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/paddleocrvl.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/paddleocrvl.png" width="800"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -100,7 +102,6 @@ Python API usage:
|
||||
|
||||
```python
|
||||
from paddleocr import PaddleOCRVL
|
||||
|
||||
pipeline = PaddleOCRVL()
|
||||
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
|
||||
for res in output:
|
||||
@ -120,7 +121,6 @@ for res in output:
|
||||
--network host \
|
||||
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server
|
||||
```
|
||||
|
||||
2. Call the PaddleOCR CLI or Python API:
|
||||
|
||||
```bash
|
||||
@ -129,10 +129,8 @@ for res in output:
|
||||
--vl_rec_backend vllm-server \
|
||||
--vl_rec_server_url http://127.0.0.1:8080/v1
|
||||
```
|
||||
|
||||
```python
|
||||
from paddleocr import PaddleOCRVL
|
||||
|
||||
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
|
||||
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
|
||||
for res in output:
|
||||
@ -140,7 +138,8 @@ for res in output:
|
||||
res.save_to_json(save_path="output")
|
||||
res.save_to_markdown(save_path="output")
|
||||
```
|
||||
|
||||
|
||||
**For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).**
|
||||
## Performance
|
||||
|
||||
### Page-Level Document Parsing
|
||||
@ -151,19 +150,18 @@ for res in output:
|
||||
##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/omni15.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni15.png" width="800"/>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
|
||||
#### 2. OmniDocBench v1.0
|
||||
|
||||
##### PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/omni10.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omni10.png" width="800"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -180,7 +178,7 @@ for res in output:
|
||||
PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/omnibenchocr.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/omnibenchocr.png" width="800"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -189,7 +187,7 @@ PaddleOCR-VL’s robust and versatile capability in handling diverse document ty
|
||||
In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/inhouseocr.png" width="800"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouseocr.png" width="800"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -201,7 +199,7 @@ In-house-OCR provides a evaluation of performance across multiple languages and
|
||||
Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/inhousetable.png" width="600"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousetable.png" width="600"/>
|
||||
</div>
|
||||
|
||||
#### 3. Formula
|
||||
@ -211,7 +209,7 @@ Our self-built evaluation set contains diverse types of table images, such as Ch
|
||||
In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/inhouse-formula.png" width="500"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhouse-formula.png" width="500"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -222,7 +220,7 @@ In-house-Formula evaluation set contains simple prints, complex prints, camera s
|
||||
The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/inhousechart.png" width="400"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/inhousechart.png" width="400"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -237,42 +235,42 @@ The evaluation set is broadly categorized into 11 chart categories, including ba
|
||||
### Comprehensive Document Parsing
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/overview1.jpg" width="600"/>
|
||||
<img src="./imgs/overview2.jpg" width="600"/>
|
||||
<img src="./imgs/overview3.jpg" width="600"/>
|
||||
<img src="./imgs/overview4.jpg" width="600"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview1.jpg" width="600"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview2.jpg" width="600"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview3.jpg" width="600"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/overview4.jpg" width="600"/>
|
||||
</div>
|
||||
|
||||
|
||||
### Text
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="./imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
|
||||
</div>
|
||||
|
||||
|
||||
### Table
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/table_01.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="./imgs/table_02.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
|
||||
</div>
|
||||
|
||||
|
||||
### Formula
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="./imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
|
||||
</div>
|
||||
|
||||
|
||||
### Chart
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="./imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
|
||||
<img src="https://modelscope.cn/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/master/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
|
||||
</div>
|
||||
|
||||
|
||||
@ -285,11 +283,13 @@ We would like to thank [ERNIE](https://github.com/PaddlePaddle/ERNIE), [Keye](ht
|
||||
If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
|
||||
|
||||
```bibtex
|
||||
@misc{paddleocrvl2025technicalreport,
|
||||
title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
|
||||
author={Cui, C. et al.},
|
||||
@misc{cui2025paddleocrvlboostingmultilingualdocument,
|
||||
title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
|
||||
author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
|
||||
year={2025},
|
||||
primaryClass={cs.CL},
|
||||
howpublished={\url{https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf}}
|
||||
eprint={2510.14528},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CV},
|
||||
url={https://arxiv.org/abs/2510.14528},
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user