Commit Graph

217 Commits

Author SHA1 Message Date
Marco Fargetta
4046d0b2f3 fix: AsciiDoc header identification (#1562) (#1563)
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.

Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
127e38646f fix: add smoldocling in download utils (#1577)
add smoldocling in download utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-12 10:48:07 +02:00
Cesar Berrospi Ramis
776e7ecf9a fix(HTML): handle row spans in header rows (#1536)
* chore(HTML): log the stacktrace of errors

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(HTML): handle row headers like in pivot tables

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-09 15:14:32 +02:00
DavidLee
f1658edbad fix: mime error in document streams (#1523)
Update document.py

edit got file mime error

Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9 fix: usage of hashlib for FIPS (#1512)
fix usage of hashlib for FIPS

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-02 15:03:29 +02:00
Ihar Hrachyshka
b147331f2a chore: restore typing hint for self.script_readers (#1500)
With future annotations, typing hints resolution is always deferred.

https://peps.python.org/pep-0563/

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb fix: Guard against attribute errors in TesseractOcrModel __del__ (#1494)
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.

This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.

This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9 fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496)
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel

Signed-off-by: Zach Cox <zach.s.cox@gmail.com>
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289 fix: updated the time-recorder label for reading order (#1490)
* fix: updated the time-recorder label for reading order

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-04-29 13:02:53 +02:00
nkh0472
a097ccd8d5 chore: typo fix (#1465)
* typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

---------

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-28 08:52:09 +02:00
Maxim Lysak
94d66a0765 fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459)
fixing double scaling in case of do_cell_matching is False

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-25 12:34:12 +02:00
Cesar Berrospi Ramis
ed20124544 fix(html): handle address, details, and summary tags (#1436)
* fix(html): handle 'address' tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(html): handle 'details' tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-23 09:30:59 +02:00
Eugene
8012a3e4d6 fix: Treat overflowing -v flags as DEBUG (#1419)
Signed-off-by: Eugene <fogaprod@gmail.com>
2025-04-19 11:02:41 +02:00
Michele Dolfi
5458a88464 ci: add coverage and ruff (#1383)
* add coverage calculation and push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* new codecov version and usage of token

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enable ruff formatter instead of black and isort

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff lint fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff unsafe fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add removed imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* runs 1 on linter issues

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* finalize linter fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update pyproject.toml

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-14 18:01:26 +02:00
Peter W. J. Staar
c0ba88edf1 feat(cli): add option for html with split-page mode (#1355)
* updated the cli to output html in split-page mode

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* add pin for new docling-core with html split argument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* relock with fixed html export in docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update more tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock with docling-core fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add again chunking extras

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991 fix: auto-recognize .xlsx, .docx and .pptx files (#1340)
* bug: auto-recognize .xlsx files

Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com>

* apply styling

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply to other ms office zip formats

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 07:45:13 +02:00
Cesar Berrospi Ramis
415b877984 fix(docx): declare image_data variable when handling pictures (#1359)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248)
fix: Implement PictureDescriptionApiOptions.picture_area_threshold

Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com>
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77 feat(xlsx): create a page for each worksheet in XLSX backend (#1332)
* sytle(xlsx): enforce type hints in XLSX backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xlsx): create a page for each worksheet in XLSX backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docs(xlsx): add docstrings to XLSX backend module.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docling(xlsx): add bounding boxes and page size information in cell units

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9 feat: OllamaVlmModel for Granite Vision 3.2 (#1337)
* build: Add ollama sdk dependency

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add option plumbing for OllamaVlmOptions in pipeline_options

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Full implementation of OllamaVlmModel

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Connect "granite_vision_ollama" pipeline option to CLI

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "build: Add ollama sdk dependency"

After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.

This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move OpenAI API call logic into utils.utils

This will allow reuse of this logic in a generic VLM model

NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor from Ollama SDK to generic OpenAI API

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Linting, formatting, and bug fixes

The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* remove model from download enum

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* generalize input args for other API providers

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename and refactor

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* require flag for remote services

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* disable example from CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add examples to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a fix: Properly address page in pipeline _assemble_document when page_range is provided (#1334)
* Fixes #1333

Signed-off-by: Joan Fabrégat <j@fabreg.at>

* fix for the (dumb) MyPy type checker

Signed-off-by: Joan Fabrégat <j@fabreg.at>

---------

Signed-off-by: Joan Fabrégat <j@fabreg.at>
2025-04-10 16:11:28 +02:00
Maxim Lysak
355d8dc7a6 chore: Logo parameter in docling CLI, prints cute ascii logo (#1294)
logo parameter in docling cli, prints cute ascii logo

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima
14e9c0ce9a fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295)
* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(docx): Improve text parsing (#1268)

* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add visual grounding example (#1270)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat(docx): add text formatting and hyperlink support (#630)

* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(pptx): check if picture shape has an image attached (#1316)

Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: update lock file (#1315)

chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add plugins docs (#1319)

add plugin docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: handle <code> tags as code blocks (#1320)

handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e feat: handle <code> tags as code blocks (#1320)
handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
2025-04-08 10:32:06 +02:00
Maxim Lysak
dc3bf9ceac fix(pptx): check if picture shape has an image attached (#1316)
Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677 feat(docx): add text formatting and hyperlink support (#630)
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
d2d68747f9 fix(docx): Improve text parsing (#1268)
* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)
fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
2025-03-31 10:53:49 +02:00
Maxim Lysak
7afad7e52d fix: Fixes tables when using OCR (#1261)
Fix for the tables when using OCR

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-29 10:06:00 +01:00
Maxim Lysak
8bd71e8e33 fix: Word-level pdf cells for tables (#1238)
* word-level pdf cells for tables

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed comments

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated dependency to docling-core

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-28 16:34:48 +01:00
Panos Vagenas
9210812bfa fix: improve HTML layer detection, various MD fixes (#1241)
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b fix(html): fix HTML parsed heading level (#1244)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-26 10:30:23 +01:00
mislavmartinic
825b226fab fix(converter): Cache same pipeline class with different options (#1152)
* Update document_converter.py

Fixing caching same class with different options by using composite key (class, options)

# TODO this will ignore if different options have been defined for the same pipeline class.

at row 292

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>

* formatted script

* removed unnecessary hasattr check

* pre-commit chain run

---------

Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>
2025-03-25 12:18:44 +01:00
Hoang-Long Do
6df8827231 fix(debug): Missing translation of bbox to to_bounding_box (#1220)
* Fix: Add missing bbox attribute to PdfTextCell

* Fix: Add missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* fix: Refactor missing bbox attribute to PdfTextCell

Signed-off-by: hl2311 <dhlong2301@gmail.com>

* Signed-off-by: hl2311 <dhlong2301@gmail.com>

fix: Refactor missing bbox attribute to PdfTextCell

---------

Signed-off-by: hl2311 <dhlong2301@gmail.com>
2025-03-25 12:18:10 +01:00
Rafael Teixeira de Lima
f739d0e4c5 fix(docx): identifying numbered headers (#1231)
* Modifications to identify numbered headers

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Add style check

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-03-25 11:41:02 +01:00
Maxim Lysak
1c26769785 feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199)
* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated minimal vlm pipeline example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* make vlm_pipeline python3.9 compatible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed extract_text_from_backend definition

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated README

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* corrections in the documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Consmetic changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-19 15:38:54 +01:00
Maciej Wieczorek
b454aa1551 feat: Add PPTX notes slides (#474)
* feat: Add PPTX notes slides

Presenter notes may have useful information and should also be extracted.

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

* feat: Move presenter notes into furniture

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>

---------

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724 fix: Determine correct page size in DoclingParseV4Backend (#1196)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-19 11:05:42 +01:00
Rafael Teixeira de Lima
0b707d0882 fix(msword): Fixing function return in equations handling (#1194)
* Fixing function return

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Add message

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-03-19 10:34:25 +01:00
Maxim Lysak
2f72167ff6 feat: updated vlm pipeline (with latest changes from docling-core) (#1158)
* Draft implementation of Doctag backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated VLM pipeline doctags to docling conversion, now properly supports lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* preparing to migrate to new doctags deserializer

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* re-using DocTagsDocument.from_doctags_and_image_pairs

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* satisfying mypy and other checks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added support for force_backend_text parameter

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed unnecessary transformation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated readme

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-18 15:44:51 +01:00
Michele Dolfi
6eaae3cba0 feat: add factory for ocr engines via plugins (#1010)
* add factory for ocr engines

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply pre-commit after rebase

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add picture description factory

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix enable option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* switch to create methods

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make `options` an explicit kwarg

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* keep old lock of docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add allow_external_plugins option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add factory return and ignore options type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-18 13:58:05 +01:00
Christoph Auer
3960b199d6 feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905)
* Add DoclingParseV3 backend implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use docling-core with docling-parse types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes and test updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back DoclingParse v1 backend, pipeline options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update locks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Ground-truth files updated

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Text fixes, new test data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename docling backend to v4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Test all backends, fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all tests to use docling-parse v1 for now

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for DPv4 backend init, better test coverage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* test_input_doc use default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-18 10:38:19 +01:00
Michele Dolfi
fa16b12316 chore: move to docling-project org (#1160)
* chore: rename org

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update docs/faq/index.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* update github pages

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* revert test content

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-14 12:35:29 +01:00
Cesar Berrospi Ramis
f94da44ec5 fix(html): handle nested empty lists (#1154)
Address the case of nested lists in empty list items.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 16:56:58 +01:00
Panos Vagenas
0945973b79 fix: use first table row as col headers (#1156)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima
6eb718f849 feat: equations to latex in MSWord backend (with inline groups) (#1114)
* Equation groups

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Proper handling of orphan IDs in layout postprocessing (#1118)

* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.25.2 [skip ci]

* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)

add env var in docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(CLI): fix help message for abort options (#1130)

fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* perf: New revision code formula model and document picture classifier (#1140)

* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.26.0 [skip ci]

* fix: Pass tests, update docling-core to 2.22.0 (#1150)

fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Updating content hash

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
aa92a57fa9 fix: Pass tests, update docling-core to 2.22.0 (#1150)
fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 09:45:55 +01:00
Christoph Auer
eb97357b05 feat: Use new TableFormer model weights and default to accurate model version (#1100)
* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-11 10:53:49 +01:00
Matteo
5e30381c0d perf: New revision code formula model and document picture classifier (#1140)
* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
2025-03-11 10:15:28 +01:00
Michele Dolfi
4d64c4c0b6 fix(CLI): fix help message for abort options (#1130)
fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-07 14:47:49 +01:00