* Equation groups
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* fix: Proper handling of orphan IDs in layout postprocessing (#1118)
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* chore: bump version to 2.25.2 [skip ci]
* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)
add env var in docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* fix(CLI): fix help message for abort options (#1130)
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* perf: New revision code formula model and document picture classifier (#1140)
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* feat: Use new TableFormer model weights and default to accurate model version (#1100)
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
* chore: bump version to 2.26.0 [skip ci]
* fix: Pass tests, update docling-core to 2.22.0 (#1150)
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Updating content hash
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.
Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>