* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Silence the tqdm messages during the downloading of model files
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Code styling
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Use the HF API to disable the tqdm progress bars
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* updated README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed duck in title
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the index.md
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli to export html
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added html to cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* removed the duck emoji, added the in the cli. Currently, the referenced seems broken
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaning up the comments
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reference is now working
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Clean up styling and docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin docling-core>=2.7.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* adding rapidocr engine for ocr in docling
Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
* fixing styling format
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* updating pyproject.toml and poetry.lock to fix ci bugs
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* help poetry pinning for python3.9
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* simplifying rapidocr options so that device can be changed using a single option for all models
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* fix styling issues and small bug in rapidOcrOptions
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* use default device until we enable global management
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* updated the poetry lock
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems
- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* docs: update examples and installation for ocrmac support
- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.
This enhances documentation for users working on macOS to leverage `ocrmac` effectively.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* fix: update `ocrmac` dependency with macOS-specific marker
- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.
This ensures the `ocrmac` dependency is only installed on macOS systems.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
---------
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
Co-authored-by: Suhwan Seo <nuridol@gmail.com>
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: set valid=false for invalid backends
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add test case for InputDocument
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* updated the render_as_doctags with the new arguments from docling-core
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the doctags tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the README
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix poetry lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Fix formatting problems
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fixed the doctag export in docling/utils/export.py
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* propagate xsize and ysize
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* feat: adding txt and doctags output
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* cleaned up the export
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix datamodel usage for Figure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* updated all the examples to deal with new rendering
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* add the pytests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* renamed the test folder and added the toplevel test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the toplevel function test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* need to start running all tests successfully
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the reference converted documents
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added first test for json and md output
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran pre-commit
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* replaced deprecated json function with model_dump_json
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* replaced deprecated json function with model_dump_json
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix backend tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* commented out the drawing
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ci: avoid duplicate runs
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* commented out json verification for now
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added verification of input cells
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformat code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added test to verify the cells in the pages (3)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* run all examples in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* make sure examples return failures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* raise a failure if examples fail
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* run examples after tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add tests and update top_level_tests using only datamodels
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove unnecessary code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Validate conversion status on e2e test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* package verify utils and add more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* reduce docs in example, since they are already in the tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* skip batch_convert
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-parse 1.1.2
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* updated the error messages
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* commented out the json verification for now
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* bumped GLM version
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Fix lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pin new docling-parse v1.1.3
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
- fields `output` & `assembled` need not be optional
- introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* Add assemble options and example saving pages and figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add options for different page elements, improve example and flip name of assemble_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>