Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-13 10:40:08 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison ( #1511 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-02 10:52:18 +02:00
Christoph Auer
c93e36988f
feat: Implement new reading-order model ( #916 )
...
* Implement new reading-order model, replacing DS GLM model (WIP)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update reading-order model branch
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add captions, footnotes and merges [skip ci]
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updates for reading-order implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests and lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes, update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add normalization, update tests again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests with code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Push final lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* sanitize text
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Inlcude furniture, Update tests with furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix content_layer assignment
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Delete empty file docling/models/ds_glm_model.py
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-20 17:51:17 +01:00
Christoph Auer
cf78d5b7b9
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
...
* feat: Pass predicted page-headers and page-footers through to DoclingDocument furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Update all test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock to final docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-10 12:07:49 +01:00
Christoph Auer
f9144f2bb6
docs: Add example for inspection of picture content ( #624 )
...
* chore: Add example for inspection of picture content
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Test case re-generation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Test case re-generation only on CPU
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Add missing GT files
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-29 10:39:00 +01:00