* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* allow the artifacts_path to be defined as ENV
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add check if artifacts_path exists and is dir
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: Pass predicted page-headers and page-footers through to DoclingDocument furniture
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Update all test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update all test cases again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock to final docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* cleaned up the data folder in the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added three test-files for right-to-left
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix black
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added new gt for test_e2e_conversion
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added new gt for test_e2e_conversion
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* Add code to expose text direction of cell
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* new test file
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix mypy reports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix example filepaths
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add test data results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin wheel of latest docling-parse release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove debugging code
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix path to files in example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Revert unwanted RTL additions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix test data paths in examples
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.
Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Processing of placeholder shapes in pptx that have text but no bbox
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Filetype library may not identify some files as PDF. Leverage the file extension
as a simple solution.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries
- Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries.
- Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters).
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* style(rapidocr-options): fix alignment of `rec_keys_path` comment
Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made.
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
---------
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* docs: Add example how to use "auto" language with tesseract OCR engines
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected
language is installed in the system and if not fall back to a default option without language.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>