Docling/tests/data_scanned
Christoph Auer 7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-13 19:01:55 +02:00
..
groundtruth feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) 2025-06-13 19:01:55 +02:00
ocr_test_rotated_90.pdf feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
ocr_test_rotated_180.pdf feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
ocr_test_rotated_270.pdf feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
ocr_test.pdf feat: add options for choosing OCR engines (#118) 2024-10-08 19:07:08 +02:00