* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* updated the poetry lock
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems
- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* docs: update examples and installation for ocrmac support
- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.
This enhances documentation for users working on macOS to leverage `ocrmac` effectively.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* fix: update `ocrmac` dependency with macOS-specific marker
- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.
This ensures the `ocrmac` dependency is only installed on macOS systems.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
---------
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
Co-authored-by: Suhwan Seo <nuridol@gmail.com>
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* restructure title fix (#187)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* bumped the glm version and adjusted the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix hooks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the tests for tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* fix(tests): Adjust the test data to match the new version of LayoutPredictor from docling-ibm-models
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* chore: Update poetry to use `docling-ibm-models` at version `v1.2.0`
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* feat: adding txt and doctags output
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaned up the export
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix datamodel usage for Figure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* updated all the examples to deal with new rendering
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* add the pytests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* renamed the test folder and added the toplevel test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the toplevel function test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* need to start running all tests successfully
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the reference converted documents
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added first test for json and md output
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* replaced deprecated json function with model_dump_json
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* replaced deprecated json function with model_dump_json
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix backend tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* commented out the drawing
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ci: avoid duplicate runs
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* commented out json verification for now
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added verification of input cells
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformat code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages (3)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* run all examples in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* make sure examples return failures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* raise a failure if examples fail
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* run examples after tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add tests and update top_level_tests using only datamodels
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove unnecessary code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Validate conversion status on e2e test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* package verify utils and add more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* reduce docs in example, since they are already in the tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* skip batch_convert
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-parse 1.1.2
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* updated the error messages
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* commented out the json verification for now
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* bumped GLM version
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin new docling-parse v1.1.3
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
* Add repo, absolute URLs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bump version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>