Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration ( #1619 )
...
* fix(actor): remove references to missing docling_processor.py
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): update Actor README.md with recent repo URL changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): improve the Actor README.md local header link
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): bump the Actor version number
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Update .actor/actor.json
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
---------
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829
chore: bump version to 2.33.0 [skip ci]
2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com>
---------
Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link ( #1596 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 08:44:47 +02:00
github-actions[bot]
aeb0716bbb
chore: bump version to 2.32.0 [skip ci]
2025-05-14 14:28:21 +00:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls ( #1548 )
...
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str ( #1553 )
...
fix: ensure orig and text are both strings in TesseractOcrCliModel
Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com>
2025-05-14 15:05:52 +02:00
Panos Vagenas
9f28abf061
docs: add advanced chunking & serialization example ( #1589 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-14 14:35:07 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables ( #1551 )
...
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com>
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com>
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com>
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com>
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com>
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-14 09:47:28 +02:00
github-actions[bot]
23238c241f
chore: bump version to 2.31.2 [skip ci]
2025-05-13 10:09:19 +00:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-13 10:40:08 +02:00
github-actions[bot]
0d0fa6cbe3
chore: bump version to 2.31.1 [skip ci]
2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-12 10:48:07 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit ( #1559 )
...
Update data_prep_kit.md
The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.
Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com>
2025-05-11 20:38:25 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-09 15:14:32 +02:00
Panos Vagenas
3220a592e7
docs: add serialization docs, update chunking docs ( #1556 )
...
* docs: add serializers docs, update chunking docs
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update notebook to improve MD table rendering
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-08 21:43:01 +02:00
DavidLee
f1658edbad
fix: mime error in document streams ( #1523 )
...
Update document.py
edit got file mime error
Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS ( #1512 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-02 15:03:29 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison ( #1511 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-02 10:52:18 +02:00
Ihar Hrachyshka
b147331f2a
chore: restore typing hint for self.script_readers ( #1500 )
...
With future annotations, typing hints resolution is always deferred.
https://peps.python.org/pep-0563/
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb
fix: Guard against attribute errors in TesseractOcrModel __del__ ( #1494 )
...
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.
This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.
This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9
fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel ( #1496 )
...
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel
Signed-off-by: Zach Cox <zach.s.cox@gmail.com>
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289
fix: updated the time-recorder label for reading order ( #1490 )
...
* fix: updated the time-recorder label for reading order
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-04-29 13:02:53 +02:00
Michele Dolfi
d8959c6b19
chore: update dependencies in lock file ( #1458 )
...
update lock: h11 vuln and torch update
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-28 08:52:46 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-28 08:52:09 +02:00
Emmanuel Ferdman
3afbe6c969
docs: update supported formats guide ( #1463 )
...
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-04-28 08:51:54 +02:00
Maxim Lysak
94d66a0765
fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False ( #1459 )
...
fixing double scaling in case of do_cell_matching is False
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-04-25 12:34:12 +02:00
github-actions[bot]
c67133dde4
chore: bump version to 2.31.0 [skip ci]
2025-04-25 08:28:25 +00:00
Ryan Lin
a2fbbba9f7
feat: add tutorial using Milvus and Docling for RAG pipeline ( #1449 )
...
* feat: add milvus rag with docling tutorial
Signed-off-by: Ryan Lin <linjinhong@yandex.com>
* chore: run pre-commit
Signed-off-by: Ryan Lin <linjinhong@yandex.com>
* feat: add RAG with Milvus example to mkdocs
Signed-off-by: Ryan Lin <linjinhong@yandex.com>
---------
Signed-off-by: Ryan Lin <linjinhong@yandex.com>
2025-04-25 09:12:35 +02:00
Michele Dolfi
976431ed7f
chore: update locked deps ( #1442 )
...
update deps
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-23 14:59:31 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-23 09:30:59 +02:00
nkh0472
c2470ed216
docs: Fix wrong output format in example code ( #1427 )
...
fix: wrong output format
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>
2025-04-22 12:32:55 +02:00
Michele Dolfi
64918a81ac
docs: Add OpenSSF Best Practices badge ( #1430 )
...
* docs: add openssf badge
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add badge to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-22 11:23:28 +02:00
Ben Cox
995b3b0ab1
docs: Typo fixes in docling_document.md ( #1400 )
...
Signed-off-by: Ben Cox <1038350+ind1go@users.noreply.github.com>
2025-04-22 08:49:08 +02:00
Eugene
8012a3e4d6
fix: Treat overflowing -v flags as DEBUG ( #1419 )
...
Signed-off-by: Eugene <fogaprod@gmail.com>
2025-04-19 11:02:41 +02:00
Leandro Rosas
88948b0bba
docs: Updated the [Usage] link in architecture.md ( #1416 )
...
Fixed the [Usage] link in architecture.md
Changed the usage link in the tip box from "../usage.md#adjust-pipeline-features" to "../usage/index.md#adjust-pipeline-features" as the previous link is not valid.
Signed-off-by: Leandro Rosas <36343022+leandrosas101@users.noreply.github.com>
2025-04-19 10:20:52 +02:00
Cesar Berrospi Ramis
fa7fc9e63d
fix(codecov): fix codecov argument and yaml file ( #1399 )
...
* fix(codecov): fix codecov argument and yaml file
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* ci: set the codecov status to success even if the CI fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-15 18:12:57 +02:00
Panos Vagenas
550b1ca2f8
chore: propagate docling-core fix ( #1389 )
...
* chore: propagate docling-core fix
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update lock to latest docling-core release
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-04-15 10:51:47 +02:00
Felix Dittrich
a7dd59c5cb
docs(ocr): Add docs entry for OnnxTR OCR plugin ( #1382 )
...
feat(ocr): Add docs entry for OnnxTR OCR plugin
Signed-off-by: felix <felixdittrich92@gmail.com>
2025-04-15 09:46:59 +02:00
Michele Dolfi
06227e9970
ci: sign pypi packages ( #1392 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-15 08:59:16 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-14 18:01:26 +02:00
Michele Dolfi
293c28ca7c
docs(security): more statements about secure development ( #1381 )
...
docs: more statement about secure development
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 13:53:26 +02:00
Michele Dolfi
01fbfd5652
docs: Add testing in the docs ( #1379 )
...
* add testing to CONTRIBUTING
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* document test generation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* typo
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 12:31:48 +02:00
Michele Dolfi
d9c3999175
chore: update lock file ( #1378 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-14 10:38:10 +02:00