Commit Graph

9 Commits

Author SHA1 Message Date
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905)
* Add DoclingParseV3 backend implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use docling-core with docling-parse types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes and test updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back DoclingParse v1 backend, pipeline options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update locks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Ground-truth files updated

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Text fixes, new test data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename docling backend to v4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Test all backends, fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all tests to use docling-parse v1 for now

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for DPv4 backend init, better test coverage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* test_input_doc use default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-18 10:38:19 +01:00
Michele Dolfi
8dc0562542
fix: enable locks for threadsafe pdfium (#1052)
* enable locks for threadsafe pdfium

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix deadlock in pypdfium2 backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-02 20:06:44 +01:00
Michele Dolfi
57fc28d3d8
refactor: allow the usage of backends in the enrich models and generalize the interface (#742)
* fix get image with cropbox

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* allow the usage of backends in the enrich models and generalize the interface

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move logic in BaseTextImageEnrichmentModel

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-15 09:52:38 +01:00
Christoph Auer
5a060f237d
fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719)
fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-10 10:38:49 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo (#241)
* chore: update pypdfium2_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

* chore: update docling_parse_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

* chore: update docling_parse_v2_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

---------

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
2024-11-05 16:20:04 +01:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history (#170)
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update imports for docling_parse.pdf_parser_v1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Lock docling-parse 2.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Lock docling-parse 2.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin poetry.lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-23 17:20:11 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 (#117)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend (#131)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 15:12:49 +02:00