Commit Graph

9 Commits

Author SHA1 Message Date
Christoph Auer
7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-13 19:01:55 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549)
get merged_text from boundingbox instead of merging it to prevent overlaps

Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison (#1511)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-02 10:52:18 +02:00
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905)
* Add DoclingParseV3 backend implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use docling-core with docling-parse types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes and test updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back DoclingParse v1 backend, pipeline options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update locks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Ground-truth files updated

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Text fixes, new test data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename docling backend to v4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Test all backends, fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all tests to use docling-parse v1 for now

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for DPv4 backend init, better test coverage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* test_input_doc use default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-18 10:38:19 +01:00
Christoph Auer
c56ab3a66b
fix: Proper handling of orphan IDs in layout postprocessing (#1118)
* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-05 14:30:59 +01:00
Michele Dolfi
9114ada7bc
fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 08:43:31 +01:00
Michele Dolfi
d01a2e73ee
test: update results with new docling-core (#839)
* test: update results with new docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix table output in 2203.01017v2.md

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-30 14:07:52 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown (#752)
* propagated changes for new CodeItem class

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Rebased branch on latest main. changes for CodeItem

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused files

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* chore: update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* pin latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-core pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use new add_code in backends and update typing in MD backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* added if statement for backend

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed unused import

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* removed print statements

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* gt for new pdf

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* Update docling/pipeline/standard_pdf_pipeline.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>

* fixed doc comment of __call__ function of code_formula_model

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>

* fix artifacts_path type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move expansion_factor to base class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-24 16:54:22 +01:00