Commit Graph

489 Commits

Author SHA1 Message Date
Panos Vagenas
9f28abf061
docs: add advanced chunking & serialization example (#1589)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-14 14:35:07 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables (#1551)
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com>
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type (#1415)
* support image/webp file type

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>

* docs: add webp image format in supported_formats.md

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>

* test: add a test case for `image/webp` file

Signed-off-by: Elwin <hzywong@gmail.com>

* style: apply styling

Signed-off-by: Elwin <hzywong@gmail.com>

* test: update test case of converting `image/webp` file with more ocr engines

Signed-off-by: Elwin <hzywong@gmail.com>

* style: apply styling

Signed-off-by: Elwin <hzywong@gmail.com>

* rename test file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-14 09:47:28 +02:00
github-actions[bot]
23238c241f chore: bump version to 2.31.2 [skip ci] 2025-05-13 10:09:19 +00:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification (#1562) (#1563)
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.

Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file (#1582)
* fix click dependency and update lock file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-13 10:40:08 +02:00
github-actions[bot]
0d0fa6cbe3 chore: bump version to 2.31.1 [skip ci] 2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils (#1577)
add smoldocling in download utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-12 10:48:07 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit (#1559)
Update data_prep_kit.md

The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.

Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com>
2025-05-11 20:38:25 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows (#1536)
* chore(HTML): log the stacktrace of errors

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(HTML): handle row headers like in pivot tables

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-09 15:14:32 +02:00
Panos Vagenas
3220a592e7
docs: add serialization docs, update chunking docs (#1556)
* docs: add serializers docs, update chunking docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update notebook to improve MD table rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-08 21:43:01 +02:00
DavidLee
f1658edbad
fix: mime error in document streams (#1523)
Update document.py

edit got file mime error

Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS (#1512)
fix usage of hashlib for FIPS

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-02 15:03:29 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison (#1511)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-02 10:52:18 +02:00
Ihar Hrachyshka
b147331f2a
chore: restore typing hint for self.script_readers (#1500)
With future annotations, typing hints resolution is always deferred.

https://peps.python.org/pep-0563/

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb
fix: Guard against attribute errors in TesseractOcrModel __del__ (#1494)
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.

This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.

This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9
fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496)
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel

Signed-off-by: Zach Cox <zach.s.cox@gmail.com>
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289
fix: updated the time-recorder label for reading order (#1490)
* fix: updated the time-recorder label for reading order

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-04-29 13:02:53 +02:00
Michele Dolfi
d8959c6b19
chore: update dependencies in lock file (#1458)
update lock: h11 vuln and torch update

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-28 08:52:46 +02:00
nkh0472
a097ccd8d5
chore: typo fix (#1465)
* typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

* chore: typo fix

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>

---------

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-28 08:52:09 +02:00
Emmanuel Ferdman
3afbe6c969
docs: update supported formats guide (#1463)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-04-28 08:51:54 +02:00
Maxim Lysak
94d66a0765
fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459)
fixing double scaling in case of do_cell_matching is False

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-25 12:34:12 +02:00
github-actions[bot]
c67133dde4 chore: bump version to 2.31.0 [skip ci] 2025-04-25 08:28:25 +00:00
Ryan Lin
a2fbbba9f7
feat: add tutorial using Milvus and Docling for RAG pipeline (#1449)
* feat: add milvus rag with docling tutorial

Signed-off-by: Ryan Lin <linjinhong@yandex.com>

* chore: run pre-commit

Signed-off-by: Ryan Lin <linjinhong@yandex.com>

* feat: add RAG with Milvus example to mkdocs

Signed-off-by: Ryan Lin <linjinhong@yandex.com>

---------

Signed-off-by: Ryan Lin <linjinhong@yandex.com>
2025-04-25 09:12:35 +02:00
Michele Dolfi
976431ed7f
chore: update locked deps (#1442)
update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-23 14:59:31 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags (#1436)
* fix(html): handle 'address' tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(html): handle 'details' tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-23 09:30:59 +02:00
nkh0472
c2470ed216
docs: Fix wrong output format in example code (#1427)
fix: wrong output format

Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-22 12:32:55 +02:00
Michele Dolfi
64918a81ac
docs: Add OpenSSF Best Practices badge (#1430)
* docs: add openssf badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add badge to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-22 11:23:28 +02:00
Ben Cox
995b3b0ab1
docs: Typo fixes in docling_document.md (#1400)
Signed-off-by: Ben Cox <1038350+ind1go@users.noreply.github.com>
2025-04-22 08:49:08 +02:00
Eugene
8012a3e4d6
fix: Treat overflowing -v flags as DEBUG (#1419)
Signed-off-by: Eugene <fogaprod@gmail.com>
2025-04-19 11:02:41 +02:00
Leandro Rosas
88948b0bba
docs: Updated the [Usage] link in architecture.md (#1416)
Fixed the [Usage] link in architecture.md

Changed the usage link in the tip box from "../usage.md#adjust-pipeline-features" to "../usage/index.md#adjust-pipeline-features"  as the previous link is not valid.

Signed-off-by: Leandro Rosas <36343022+leandrosas101@users.noreply.github.com>
2025-04-19 10:20:52 +02:00
Cesar Berrospi Ramis
fa7fc9e63d
fix(codecov): fix codecov argument and yaml file (#1399)
* fix(codecov): fix codecov argument and yaml file

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* ci: set the codecov status to success even if the CI fails

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-15 18:12:57 +02:00
Panos Vagenas
550b1ca2f8
chore: propagate docling-core fix (#1389)
* chore: propagate docling-core fix

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update lock to latest docling-core release

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-15 10:51:47 +02:00
Felix Dittrich
a7dd59c5cb
docs(ocr): Add docs entry for OnnxTR OCR plugin (#1382)
feat(ocr): Add docs entry for OnnxTR OCR plugin

Signed-off-by: felix <felixdittrich92@gmail.com>
2025-04-15 09:46:59 +02:00
Michele Dolfi
06227e9970
ci: sign pypi packages (#1392)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-15 08:59:16 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff (#1383)
* add coverage calculation and push

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* new codecov version and usage of token

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enable ruff formatter instead of black and isort

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff lint fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply ruff unsafe fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add removed imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* runs 1 on linter issues

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* finalize linter fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update pyproject.toml

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-14 18:01:26 +02:00
Michele Dolfi
293c28ca7c
docs(security): more statements about secure development (#1381)
docs: more statement about secure development

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 13:53:26 +02:00
Michele Dolfi
01fbfd5652
docs: Add testing in the docs (#1379)
* add testing to CONTRIBUTING

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* document test generation

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 12:31:48 +02:00
Michele Dolfi
d9c3999175
chore: update lock file (#1378)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 10:38:10 +02:00
Juil Park
a026b4e84b
docs: Add Notes for Installing in Intel macOS (#1377)
docs: Add Notes for Intel macOS

Signed-off-by: Juil Park <park@juil.dev>
2025-04-14 10:21:13 +02:00
github-actions[bot]
c391adb5f0 chore: bump version to 2.30.0 [skip ci] 2025-04-14 08:20:31 +00:00
Michele Dolfi
7e40ad3261
fix(deps): widen typer upper bound (#1375)
bump typer

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 09:23:39 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode (#1355)
* updated the cli to output html in split-page mode

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* add pin for new docling-core with html split argument

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* relock with fixed html export in docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update more tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update lock with docling-core fixes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update test results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add again chunking extras

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991
fix: auto-recognize .xlsx, .docx and .pptx files (#1340)
* bug: auto-recognize .xlsx files

Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com>

* apply styling

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply to other ms office zip formats

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 07:45:13 +02:00
Simon Leiß
b295da4bfe
chore: Update repository URL in CITATION.cff (#1363)
Update repository URL in CITATION.cff

Repository was moved to docling-project/docling, so adjust the URL.

Signed-off-by: Simon Leiß <5084100+sleiss@users.noreply.github.com>
2025-04-14 06:57:04 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures (#1359)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d
fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248)
fix: Implement PictureDescriptionApiOptions.picture_area_threshold

Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com>
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend (#1332)
* sytle(xlsx): enforce type hints in XLSX backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xlsx): create a page for each worksheet in XLSX backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docs(xlsx): add docstrings to XLSX backend module.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* docling(xlsx): add bounding boxes and page size information in cell units

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9
feat: OllamaVlmModel for Granite Vision 3.2 (#1337)
* build: Add ollama sdk dependency

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add option plumbing for OllamaVlmOptions in pipeline_options

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Full implementation of OllamaVlmModel

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Connect "granite_vision_ollama" pipeline option to CLI

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "build: Add ollama sdk dependency"

After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.

This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move OpenAI API call logic into utils.utils

This will allow reuse of this logic in a generic VLM model

NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor from Ollama SDK to generic OpenAI API

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Linting, formatting, and bug fixes

The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.

Branch: OllamaVlmModel

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* remove model from download enum

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* generalize input args for other API providers

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename and refactor

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* require flag for remote services

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* disable example from CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add examples to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a
fix: Properly address page in pipeline _assemble_document when page_range is provided (#1334)
* Fixes #1333

Signed-off-by: Joan Fabrégat <j@fabreg.at>

* fix for the (dumb) MyPy type checker

Signed-off-by: Joan Fabrégat <j@fabreg.at>

---------

Signed-off-by: Joan Fabrégat <j@fabreg.at>
2025-04-10 16:11:28 +02:00