diff --git a/README.md b/README.md index 502b9eb..52b7b95 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
- +
## Introduction @@ -67,12 +67,13 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
- +
## News * ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance. +* ```2025.10.29``` Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library. ## Usage @@ -140,6 +141,59 @@ for res in output: ``` **For more usage details and parameter explanations, see the [documentation](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html).** + +## PaddleOCR-VL-0.9B Usage with transformers + +Currently, we support inference using the PaddleOCR-VL-0.9B model with the `transformers` library, which can recognize texts, formulas, tables, and chart elements. In the future, we plan to support full document parsing inference with `transformers`. Below is a simple script we provide to support inference using the PaddleOCR-VL-0.9B model with `transformers`. + +> [!NOTE] +> Note: We currently recommend using the official method for inference, as it is faster and supports page-level document parsing. The example code below only supports element-level recognition. + +```python +from PIL import Image +import torch +from transformers import AutoModelForCausalLM, AutoProcessor + +DEVICE = "cuda" if torch.cuda.is_available() else "cpu" + +CHOSEN_TASK = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' +PROMPTS = { + "ocr": "OCR:", + "table": "Table Recognition:", + "formula": "Formula Recognition:", + "chart": "Chart Recognition:", +} + +model_path = "PaddlePaddle/PaddleOCR-VL" +image_path = "test.png" +image = Image.open(image_path).convert("RGB") + +model = AutoModelForCausalLM.from_pretrained( + model_path, trust_remote_code=True, torch_dtype=torch.bfloat16 +).to(DEVICE).eval() +processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + +messages = [ + {"role": "user", + "content": [ + {"type": "image", "image": image}, + {"type": "text", "text": PROMPTS[CHOSEN_TASK]}, + ] + } +] +inputs = processor.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_dict=True, + return_tensors="pt" +).to(DEVICE) + +outputs = model.generate(**inputs, max_new_tokens=1024) +outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0] +print(outputs) +``` + ## Performance ### Page-Level Document Parsing @@ -150,7 +204,7 @@ for res in output: ##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
- +
@@ -161,7 +215,7 @@ for res in output:
- +
@@ -178,7 +232,7 @@ for res in output: PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
- +
@@ -187,7 +241,7 @@ PaddleOCR-VL’s robust and versatile capability in handling diverse document ty In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
- +
@@ -199,7 +253,7 @@ In-house-OCR provides a evaluation of performance across multiple languages and Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
- +
#### 3. Formula @@ -209,7 +263,7 @@ Our self-built evaluation set contains diverse types of table images, such as Ch In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
- +
@@ -220,7 +274,7 @@ In-house-Formula evaluation set contains simple prints, complex prints, camera s The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
- +
@@ -235,42 +289,42 @@ The evaluation set is broadly categorized into 11 chart categories, including ba ### Comprehensive Document Parsing
- - - - + + + +
### Text
- - + +
### Table
- - + +
### Formula
- - + +
### Chart
- - + +
@@ -292,4 +346,4 @@ If you find PaddleOCR-VL helpful, feel free to give us a star and citation. primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.14528}, } -``` +``` \ No newline at end of file diff --git a/chat_template.jinja b/chat_template.jinja index f92b066..116312d 100644 --- a/chat_template.jinja +++ b/chat_template.jinja @@ -7,14 +7,38 @@ {%- if not sep_token is defined -%} {%- set sep_token = "<|end_of_sentence|>" -%} {%- endif -%} +{%- if not image_token is defined -%} + {%- set image_token = "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" -%} +{%- endif -%} {{- cls_token -}} {%- for message in messages -%} {%- if message["role"] == "user" -%} - {{- "User: <|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" + message["content"] + "\n" -}} + {{- "User: " -}} + {%- for content in message["content"] -%} + {%- if content["type"] == "image" -%} + {{ image_token }} + {%- endif -%} + {%- endfor -%} + {%- for content in message["content"] -%} + {%- if content["type"] == "text" -%} + {{ content["text"] }} + {%- endif -%} + {%- endfor -%} + {{ "\n" -}} {%- elif message["role"] == "assistant" -%} - {{- "Assistant: " + message["content"] + sep_token -}} + {{- "Assistant: " -}} + {%- for content in message["content"] -%} + {%- if content["type"] == "text" -%} + {{ content["text"] + "\n" }} + {%- endif -%} + {%- endfor -%} + {{ sep_token -}} {%- elif message["role"] == "system" -%} - {{- message["content"] -}} + {%- for content in message["content"] -%} + {%- if content["type"] == "text" -%} + {{ content["text"] + "\n" }} + {%- endif -%} + {%- endfor -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt -%}