Commit Graph

228 Commits

Author SHA1 Message Date
Panos Vagenas
84c46fdeb3
docs: extend integration docs & README (#456)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-28 09:41:21 +01:00
github-actions[bot]
211f4f7570 chore: bump version to 2.8.0 [skip ci] 2024-11-27 13:29:32 +00:00
Swaymaw
85b29990be
feat(ocr): added support for RapidOCR engine (#415)
* adding rapidocr engine for ocr in docling

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>

* fixing styling format

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* updating pyproject.toml and poetry.lock to fix ci bugs

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* help poetry pinning for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplifying rapidocr options so that device can be changed using a single option for all models

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* fix styling issues and small bug in rapidOcrOptions

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* use default device until we enable global management

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-27 13:57:41 +01:00
Manuel030
767563bf8b
fix: use correct image index in word backend (#442)
* fix image index in word backend

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* fix: Fixes for wordx (#432)

* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* sign dco

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* correct rebase error

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

---------

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-27 13:45:07 +01:00
Christoph Auer
29807a2d68
fix: Update tests and examples for docling-core 2.5.1 (#449)
* Update tests for docling-core 2.5.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add export with referenced images to export_figures example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Fix OCR tests"

This reverts commit 12b575946f51950fcacece99d4d6eb682125d779.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile for docling-core 2.5.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:07:00 +01:00
github-actions[bot]
6666d9ec07 chore: bump version to 2.7.1 [skip ci] 2024-11-26 15:01:33 +00:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Michele Dolfi
d7072b4b56
fix: force pydantic < 2.10.0 (#407)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-22 08:23:11 +01:00
Peter W. J. Staar
2a1d3fd221
chore: update the README (#409)
* chore: update the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update README.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* chore: update the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:28:53 +01:00
Panos Vagenas
7a45b92078
docs: add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:23:04 +01:00
Michele Dolfi
97d571af97
chore: add downloads in README, security policy and update ci actions (#401)
* add security policy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deprecated actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add comment about licenses for new dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pypi downloads badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add citation file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-21 13:59:45 +01:00
github-actions[bot]
eb64f6d368 chore: bump version to 2.7.0 [skip ci] 2024-11-20 15:36:51 +00:00
Michele Dolfi
7b013abcf3
fix: python3.9 support (#396)
* fixes for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse with python3.9 wheels

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 15:21:40 +01:00
nuridol
6efa96c983
feat: add support for ocrmac OCR engine on macOS (#276)
* feat: add support for `ocrmac` OCR engine on macOS

- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.

This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* updated the poetry lock

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems

- Resolved formatting and linting issues
- Updated `--ocr-engine` CLI option documentation for `ocrmac`
- Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* feat: add support for `ocrmac` OCR engine on macOS

- Integrates `ocrmac` as an OCR engine option for macOS users.
- Adds configuration options and dependencies for `ocrmac`.
- Updates documentation to reflect new engine support.

This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* docs: update examples and installation for ocrmac support

- Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples.
- Included usage comments and examples for `OcrMacOptions` in OCR pipelines.
- Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+).
- Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend.

This enhances documentation for users working on macOS to leverage `ocrmac` effectively.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

* fix: update `ocrmac` dependency with macOS-specific marker

- Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility.
- Updated the content hash in `poetry.lock` to reflect the changes.

This ensures the `ocrmac` dependency is only installed on macOS systems.

Signed-off-by: Suhwan Seo <nuridol@gmail.com>

---------

Signed-off-by: Suhwan Seo <nuridol@gmail.com>
Co-authored-by: Suhwan Seo <nuridol@gmail.com>
2024-11-20 12:51:19 +01:00
Michele Dolfi
32ebf55e33
fix: propagate document limits to converter (#388)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 08:36:51 +01:00
github-actions[bot]
2cfaceb787 chore: bump version to 2.6.0 [skip ci] 2024-11-19 16:07:34 +00:00
Shubham Gupta
3f91e7d3f1
feat: added support for exporting DocItem to an image when page image is available (#379)
* Updated minimum docling-core version to 2.4.0

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

* Deprecated the generate_table_images option

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

* Updated examples to use get_image instead of element.image

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>

---------

Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>
2024-11-19 16:28:52 +01:00
Gaspard Petit
911c3bda27
docs: fixed typo in v2 example v2 (#378)
Update v2.md - fixed typo in example: iterate_items -> iterate_items()

Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
2024-11-19 16:27:19 +01:00
Michele Dolfi
ed785ea122
feat: expose ocr-lang in CLI (#375)
* feat: expose ocr-lang in CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use regex for supporting multiple sep

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-19 15:58:49 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend (#334)
* feat: added excel backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first msexcel backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tooling for the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working version for excel parsing of tables

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added proper typing for mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added proper typing for mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactor EXCEL to XLSX

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the unit tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran poetry lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding images to output [WIP]

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the msexcel

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the msexcel (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added tests for merged cells in excel

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-11-19 12:21:17 +01:00
Michele Dolfi
e6f89d520f
chore: update lock of deps (#371)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-19 10:23:59 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX (#349)
* Added picture data for pptx pictures

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added tests for pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Inferring image DPI from pptx file

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-18 15:22:28 +01:00
Michele Dolfi
7dbdbdeaf3
ci: fix mergify (#350)
* no conv commit message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mergify rules

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 17:13:01 +01:00
Michele Dolfi
364d37ca96
ci(Mergify): configuration update (#339)
* ci(Mergify): configuration update

Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove conventionalcommits from the checklist

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:33 +01:00
Michele Dolfi
ca8524ecae
docs: add automatic generation of CLI reference (#325)
* docs: add automatic generation of CLI reference

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* install deps for building CLI ref

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:17 +01:00
Panos Vagenas
25fd149c38
docs: add architecture outline (#341)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-15 12:52:41 +01:00
Carl
835e077b02
docs: fix parameter in usage.md (#332)
Signed-off-by: Carl Senze <carl.senze@aleph-alpha.com>
Co-authored-by: Carl Senze <carl.senze@aleph-alpha.com>
2024-11-15 09:24:15 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files (#330)
* Fixing images identification in the input Word files

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Populating extracted image data into docling picture for wordx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed base64 dependency in msword_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-14 13:33:34 +01:00
Panos Vagenas
bf2a85f1d4
chore: fix Qdrant notebook Colab link (#319)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-14 10:42:02 +01:00
Michele Dolfi
8b437adcde
fix: reduce logging by keeping option for more verbose (#323)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 10:08:24 +01:00
github-actions[bot]
5a44236ac2 chore: bump version to 2.5.2 [skip ci] 2024-11-13 08:19:09 +00:00
Michele Dolfi
c9341bf22e
fix: skip glm model downloads (#322)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 08:45:28 +01:00
github-actions[bot]
2c0c439a44 chore: bump version to 2.5.1 [skip ci] 2024-11-12 14:56:34 +00:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend (#314)
* Handling of single-cell tables in DOCX backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* returned try-catch on tables handling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* proceed processing the content of single cell table as if its just part of the body

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added example of trickly 1 cell table docx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Anush
7f5d35ea3c
docs: Hybrid RAG with Qdrant (#312)
Signed-off-by: Anush008 <anushshetty90@gmail.com>
2024-11-12 15:18:14 +01:00
Panos Vagenas
93fc1be61a
docs: add Data Prep Kit integration (#316)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-12 12:21:48 +01:00
github-actions[bot]
777237ebc9 chore: bump version to 2.5.0 [skip ci] 2024-11-12 10:19:55 +00:00
Christoph Auer
5d4a10b121
fix: Configure env prefix for docling settings (#315)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-12 10:57:16 +01:00
Nikos Livathinos
c6b3763ecb
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290)
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-12 09:46:14 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend (#307)
* Added handling of grouped elements in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* updated log.warn to warning

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag (#302)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 15:00:11 +01:00
Panos Vagenas
1239ade275
docs: add navigation indices (#305)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-11 14:49:06 +01:00
Michele Dolfi
97f214efdd
fix: allow mps usage for easyocr (#286)
* fix: allow mps usage for easyocr

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add example for cpu-only

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* comment out example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-10 14:26:17 +01:00
github-actions[bot]
be8aa17291 chore: bump version to 2.4.2 [skip ci] 2024-11-08 16:31:47 +00:00
Nikos Livathinos
0eb065e9b6
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282)
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 16:48:41 +01:00
github-actions[bot]
118f162e64 chore: bump version to 2.4.1 [skip ci] 2024-11-08 12:37:36 +00:00
Nikos Livathinos
704d792a79
fix(tesserocr): Raise Exception if tesserocr has not loaded any languages (#279)
fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 13:03:09 +01:00
Panos Vagenas
6c22cba0a7
chore: add issue templates (#251)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo (#241)
* chore: update pypdfium2_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

* chore: update docling_parse_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

* chore: update docling_parse_v2_backend.py

occured -> occurred

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

---------

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
2024-11-05 16:20:04 +01:00
Panos Vagenas
a84ec276b0
docs: update badges & credits (#248)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00