feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make page.parsed_page the only source of truth for text cells Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Correctly compute PDF boxes from pymupdf Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use different OCR engine order Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add type hints and fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * One more test fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove with pypdfium2_lock from caller sites Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix typing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
101474
tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
vendored
101474
tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
vendored
File diff suppressed because it is too large
Load Diff
89985
tests/data/groundtruth/docling_v2/2206.01062.pages.json
vendored
89985
tests/data/groundtruth/docling_v2/2206.01062.pages.json
vendored
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
56232
tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
vendored
56232
tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
vendored
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
9633
tests/data/groundtruth/docling_v2/multi_page.pages.json
vendored
9633
tests/data/groundtruth/docling_v2/multi_page.pages.json
vendored
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user