Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS ( #1512 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-02 15:03:29 +02:00
Ihar Hrachyshka
b147331f2a
chore: restore typing hint for self.script_readers ( #1500 )
...
With future annotations, typing hints resolution is always deferred.
https://peps.python.org/pep-0563/
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com >
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb
fix: Guard against attribute errors in TesseractOcrModel __del__ ( #1494 )
...
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.
This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.
This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9
fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel ( #1496 )
...
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel
Signed-off-by: Zach Cox <zach.s.cox@gmail.com >
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289
fix: updated the time-recorder label for reading order ( #1490 )
...
* fix: updated the time-recorder label for reading order
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-04-29 13:02:53 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-28 08:52:09 +02:00
Maxim Lysak
94d66a0765
fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False ( #1459 )
...
fixing double scaling in case of do_cell_matching is False
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-25 12:34:12 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Eugene
8012a3e4d6
fix: Treat overflowing -v flags as DEBUG ( #1419 )
...
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-04-19 11:02:41 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode ( #1355 )
...
* updated the cli to output html in split-page mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* add pin for new docling-core with html split argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* relock with fixed html export in docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update lock with docling-core fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add again chunking extras
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991
fix: auto-recognize .xlsx, .docx and .pptx files ( #1340 )
...
* bug: auto-recognize .xlsx files
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
* apply styling
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply to other ms office zip formats
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 07:45:13 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d
fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold ( #1248 )
...
fix: Implement PictureDescriptionApiOptions.picture_area_threshold
Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com >
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9
feat: OllamaVlmModel for Granite Vision 3.2 ( #1337 )
...
* build: Add ollama sdk dependency
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add option plumbing for OllamaVlmOptions in pipeline_options
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Full implementation of OllamaVlmModel
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Connect "granite_vision_ollama" pipeline option to CLI
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Revert "build: Add ollama sdk dependency"
After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.
This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Move OpenAI API call logic into utils.utils
This will allow reuse of this logic in a generic VLM model
NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Refactor from Ollama SDK to generic OpenAI API
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Linting, formatting, and bug fixes
The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* remove model from download enum
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* generalize input args for other API providers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename and refactor
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* require flag for remote services
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* disable example from CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add examples to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a
fix: Properly address page in pipeline _assemble_document when page_range is provided ( #1334 )
...
* Fixes #1333
Signed-off-by: Joan Fabrégat <j@fabreg.at >
* fix for the (dumb) MyPy type checker
Signed-off-by: Joan Fabrégat <j@fabreg.at >
---------
Signed-off-by: Joan Fabrégat <j@fabreg.at >
2025-04-10 16:11:28 +02:00
Maxim Lysak
355d8dc7a6
chore: Logo parameter in docling CLI, prints cute ascii logo ( #1294 )
...
logo parameter in docling cli, prints cute ascii logo
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-04-02 12:56:44 +02:00
Guilhem VERMOREL
b3d111a3cd
fix: Tesseract OCR CLI can't process images composed with numbers only ( #1201 )
...
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-03-31 10:53:49 +02:00
Maxim Lysak
7afad7e52d
fix: Fixes tables when using OCR ( #1261 )
...
Fix for the tables when using OCR
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-03-29 10:06:00 +01:00
Maxim Lysak
8bd71e8e33
fix: Word-level pdf cells for tables ( #1238 )
...
* word-level pdf cells for tables
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed comments
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated dependency to docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-03-28 16:34:48 +01:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes ( #1241 )
...
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level ( #1244 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 10:30:23 +01:00
mislavmartinic
825b226fab
fix(converter): Cache same pipeline class with different options ( #1152 )
...
* Update document_converter.py
Fixing caching same class with different options by using composite key (class, options)
# TODO this will ignore if different options have been defined for the same pipeline class.
at row 292
Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com >
* formatted script
* removed unnecessary hasattr check
* pre-commit chain run
---------
Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com >
2025-03-25 12:18:44 +01:00
Hoang-Long Do
6df8827231
fix(debug): Missing translation of bbox to to_bounding_box ( #1220 )
...
* Fix: Add missing bbox attribute to PdfTextCell
* Fix: Add missing bbox attribute to PdfTextCell
Signed-off-by: hl2311 <dhlong2301@gmail.com >
* fix: Refactor missing bbox attribute to PdfTextCell
Signed-off-by: hl2311 <dhlong2301@gmail.com >
* Signed-off-by: hl2311 <dhlong2301@gmail.com >
fix: Refactor missing bbox attribute to PdfTextCell
---------
Signed-off-by: hl2311 <dhlong2301@gmail.com >
2025-03-25 12:18:10 +01:00
Rafael Teixeira de Lima
f739d0e4c5
fix(docx): identifying numbered headers ( #1231 )
...
* Modifications to identify numbered headers
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add style check
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-25 11:41:02 +01:00
Maxim Lysak
1c26769785
feat(SmolDocling): Support MLX acceleration in VLM pipeline ( #1199 )
...
* Initial implementation to support MLX for VLM pipeline and SmolDocling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* mlx_model unit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Add CLI choices for VLM pipeline and model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Initial implementation to support MLX for VLM pipeline and SmolDocling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* mlx_model unit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Add CLI choices for VLM pipeline and model
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Updated minimal vlm pipeline example
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* make vlm_pipeline python3.9 compatible
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed extract_text_from_backend definition
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated README
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated example
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated documentation
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* corrections in the documentation
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Consmetic changes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 15:38:54 +01:00
Maciej Wieczorek
b454aa1551
feat: Add PPTX notes slides ( #474 )
...
* feat: Add PPTX notes slides
Presenter notes may have useful information and should also be extracted.
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
* feat: Move presenter notes into furniture
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
---------
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724
fix: Determine correct page size in DoclingParseV4Backend ( #1196 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 11:05:42 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling ( #1194 )
...
* Fixing function return
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add message
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-19 10:34:25 +01:00
Maxim Lysak
2f72167ff6
feat: updated vlm pipeline (with latest changes from docling-core) ( #1158 )
...
* Draft implementation of Doctag backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated VLM pipeline doctags to docling conversion, now properly supports lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* preparing to migrate to new doctags deserializer
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* re-using DocTagsDocument.from_doctags_and_image_pairs
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* satisfying mypy and other checks
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added support for force_backend_text parameter
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed unnecessary transformation
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Cleaned up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Updated readme
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-18 15:44:51 +01:00
Michele Dolfi
6eaae3cba0
feat: add factory for ocr engines via plugins ( #1010 )
...
* add factory for ocr engines
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply pre-commit after rebase
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add picture description factory
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix enable option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* switch to create methods
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* make `options` an explicit kwarg
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* keep old lock of docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add allow_external_plugins option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add factory return and ignore options type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-18 13:58:05 +01:00
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API ( #905 )
...
* Add DoclingParseV3 backend implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use docling-core with docling-parse types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes and test updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test units
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add back DoclingParse v1 backend, pipeline options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update locks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Ground-truth files updated
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update tests, use TextCell.from_ocr property
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Text fixes, new test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename docling backend to v4
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Test all backends, fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset all tests to use docling-parse v1 for now
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for DPv4 backend init, better test coverage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* test_input_doc use default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-18 10:38:19 +01:00
Michele Dolfi
fa16b12316
chore: move to docling-project org ( #1160 )
...
* chore: rename org
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update docs/faq/index.md
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
* update github pages
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* revert test content
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-03-14 12:35:29 +01:00
Cesar Berrospi Ramis
f94da44ec5
fix(html): handle nested empty lists ( #1154 )
...
Address the case of nested lists in empty list items.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 16:56:58 +01:00
Panos Vagenas
0945973b79
fix: use first table row as col headers ( #1156 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima
6eb718f849
feat: equations to latex in MSWord backend (with inline groups) ( #1114 )
...
* Equation groups
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Proper handling of orphan IDs in layout postprocessing (#1118 )
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.25.2 [skip ci]
* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124 )
add env var in docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(CLI): fix help message for abort options (#1130 )
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* perf: New revision code formula model and document picture classifier (#1140 )
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: Use new TableFormer model weights and default to accurate model version (#1100 )
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.26.0 [skip ci]
* fix: Pass tests, update docling-core to 2.22.0 (#1150 )
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Updating content hash
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
aa92a57fa9
fix: Pass tests, update docling-core to 2.22.0 ( #1150 )
...
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 09:45:55 +01:00
Christoph Auer
eb97357b05
feat: Use new TableFormer model weights and default to accurate model version ( #1100 )
...
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-03-11 10:53:49 +01:00
Matteo
5e30381c0d
perf: New revision code formula model and document picture classifier ( #1140 )
...
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
2025-03-11 10:15:28 +01:00
Michele Dolfi
4d64c4c0b6
fix(CLI): fix help message for abort options ( #1130 )
...
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-03-07 14:47:49 +01:00
Christoph Auer
c56ab3a66b
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
...
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-05 14:30:59 +01:00
Michele Dolfi
8dc0562542
fix: enable locks for threadsafe pdfium ( #1052 )
...
* enable locks for threadsafe pdfium
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix deadlock in pypdfium2 backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-03-02 20:06:44 +01:00
Peter W. J. Staar
e25d557c06
refactor: add the contentlayer to html-backend ( #1040 )
...
* added the contentlayer to html-backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the handle_image function
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code of html backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* test(html): add more info if a test case fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor(html): put parsed item in body if doc has no header
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: set TextItem label to 'text' instead of 'paragraph'
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis
de7b963b09
fix(html): use 'start' attribute when parsing ordered lists from HTML docs ( #1062 )
...
* fix(html): use 'start' attribute in ordered lists
When parsing ordered lists in HTML, take into account the 'start' attribute if it exists.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(html): reduce verbosity in HTML backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-27 09:46:57 +01:00