* Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries
- Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries.
- Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters).
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* style(rapidocr-options): fix alignment of `rec_keys_path` comment
Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made.
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
---------
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Ocr AccleratorDevice
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update HF model ref, reset test generate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Repin to release package versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Many layout processing improvements, add document index type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update pinnings to docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix table box snapping
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for cluster pre-ordering
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce OCR confidence, propagate to orphan in post-processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix form and key value area groups
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust confidence in EasyOcr
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Roll back CLI changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-core pinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Annoying fixes for historical python versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test GT for legacy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Comment cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
* Update easyocr_model.py
Added this line of code to get recog_network of easyocr parameter
recog_network = self.options.recog_network
Signed-off-by: itsainii <aininawawii@gmail.com>
* Update pipeline_options.py
Added this line in EasyOcrOptions function
recog_network: Optional[str] = 'standard'
Signed-off-by: itsainii <aininawawii@gmail.com>
* Add Easyocr recog_network parameter
Signed-off-by: itsainii <aininawawii@gmail.com>
---------
Signed-off-by: itsainii <aininawawii@gmail.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Rollback changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test gt
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove unused debug settings
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Review fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Nail the accelerator defaults for MPS
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com>
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling
Usage: docling [OPTIONS] source
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf|asciido Specify input formats to convert │
│ c|md|xlsx] from. Defaults to all formats. │
│ [default: None] │
│ --to [md|json|html|text|doctags] Specify output formats. Defaults to │
│ Markdown. │
│ [default: None] │
│ --image-export-mode [placeholder|embedded|referenced] Image export mode for the document │
│ (only in case of JSON, Markdown or │
│ HTML). With `placeholder`, only the │
│ position of the image is marked in │
│ the output. In `embedded` mode, the │
│ image is embedded as base64 encoded │
│ string. In `referenced` mode, the │
│ image is exported in PNG format and │
│ referenced from the main exported │
│ document. │
│ [default: embedded] │
│ --ocr --no-ocr If enabled, the bitmap content will │
│ be processed using OCR. │
│ [default: ocr] │
│ --force-ocr --no-force-ocr Replace any existing text with OCR │
│ generated text over the full │
│ content. │
│ [default: no-force-ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract| The OCR engine to use. │
│ ocrmac|rapidocr] [default: easyocr] │
│ --ocr-lang TEXT Provide a comma-separated list of │
│ languages used by the OCR engine. │
│ Note that each OCR engine has │
│ different values for the language │
│ names. │
│ [default: None] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. │
│ [default: dlparse_v2] │
│ --table-mode [fast|accurate] The mode to use in the table │
│ structure model. │
│ [default: fast] │
│ --artifacts-path PATH If provided, the location of the │
│ model artifacts. │
│ [default: None] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will │
│ be processed using OCR. │
│ [default: no-abort-on-error] │
│ --output PATH Output directory where results are │
│ saved. │
│ [default: .] │
│ --verbose -v INTEGER Set the verbosity level. -v for │
│ info logging, -vv for debug │
│ logging. │
│ [default: 0] │
│ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which │
│ visualizes the PDF cells │
│ [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which │
│ visualizes the OCR cells │
│ [default: no-debug-visualize-ocr] │
│ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which │
│ visualizes the layour clusters │
│ [default: │
│ no-debug-visualize-layout] │
│ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which │
│ visualizes the table cells │
│ [default: │
│ no-debug-visualize-tables] │
│ --version Show version information. │
│ --document-timeout FLOAT The timeout for processing each │
│ document, in seconds. │
│ [default: None] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Silence the tqdm messages during the downloading of model files
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Code styling
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Use the HF API to disable the tqdm progress bars
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* adding rapidocr engine for ocr in docling
Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
* fixing styling format
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* updating pyproject.toml and poetry.lock to fix ci bugs
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* help poetry pinning for python3.9
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* simplifying rapidocr options so that device can be changed using a single option for all models
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* fix styling issues and small bug in rapidOcrOptions
Signed-off-by: Swaymaw <swaymaw@gmail.com>
* use default device until we enable global management
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* updated the poetry lock
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems
- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* feat: add support for `ocrmac` OCR engine on macOS
- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.
This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* docs: update examples and installation for ocrmac support
- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.
This enhances documentation for users working on macOS to leverage `ocrmac` effectively.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
* fix: update `ocrmac` dependency with macOS-specific marker
- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.
This ensures the `ocrmac` dependency is only installed on macOS systems.
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
---------
Signed-off-by: Suhwan Seo <nuridol@gmail.com>
Co-authored-by: Suhwan Seo <nuridol@gmail.com>
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>