Docling/tests
Clément Doumouro 45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
..
data feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
data_scanned feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
__init__.py fix: Add unit tests (#51) 2024-08-30 14:08:20 +02:00
test_backend_asciidoc.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_csv.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_docling_json.py feat: add Docling JSON ingestion (#783) 2025-01-24 18:05:23 +01:00
test_backend_docling_parse_v2.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_docling_parse_v4.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_docling_parse.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_html.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_jats.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_markdown.py fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
test_backend_msexcel.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_msword.py feat: add textbox content extraction in msword_backend (#1538) 2025-05-19 15:01:36 +02:00
test_backend_patent_uspto.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_pdfium.py fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) 2025-05-19 15:26:00 +02:00
test_backend_pptx.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_backend_webp.py feat: support image/webp file type (#1415) 2025-05-14 09:47:28 +02:00
test_cli.py fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
test_code_formula.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_data_gen_flag.py fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
test_document_picture_classifier.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_e2e_conversion.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_e2e_ocr_conversion.py feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
test_input_doc.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_interfaces.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_invalid_input.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_legacy_format_transform.py ci: add coverage and ruff (#1383) 2025-04-14 18:01:26 +02:00
test_options.py feat: Establish confidence estimation for document and pages (#1313) 2025-05-21 12:32:49 +02:00
test_settings_load.py fix(settings): fix nested settings load via environment variables (#1551) 2025-05-14 13:42:10 +02:00
verify_utils.py feat: support image/webp file type (#1415) 2025-05-14 09:47:28 +02:00