Docling/docling/models
Christoph Auer 7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-13 19:01:55 +02:00
..
factories ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
plugins feat: add factory for ocr engines via plugins (#1010) 2025-03-18 13:58:05 +01:00
utils feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
vlm_models_inline fix: allow custom torch_dtype in vlm models (#1735) 2025-06-10 10:52:15 +02:00
__init__.py Initial commit 2024-07-15 09:42:42 +02:00
api_vlm_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
base_model.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
base_ocr_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
code_formula_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
document_picture_classifier.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
easyocr_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
layout_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
ocr_mac_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
page_assemble_model.py feat: Establish confidence estimation for document and pages (#1313) 2025-05-21 12:32:49 +02:00
page_preprocessing_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
picture_description_api_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
picture_description_base_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
picture_description_vlm_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
rapid_ocr_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
readingorder_model.py fix: prov for merged-elems (#1728) 2025-06-10 11:22:42 +02:00
table_structure_model.py feat: new vlm-models support (#1570) 2025-06-02 17:01:06 +02:00
tesseract_ocr_cli_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
tesseract_ocr_model.py feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00