github-actions[bot]
4d11d87d06
chore: bump version to 2.17.0 [skip ci]
2025-01-28 18:37:26 +00:00
Panos Vagenas
5aed9f8aeb
fix: fix single newline handling in MD backend ( #824 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis
adf6353483
fix: use file extension if filetype fails with PDF ( #827 )
...
Filetype library may not identify some files as PDF. Leverage the file extension
as a simple solution.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-28 19:03:54 +01:00
Panos Vagenas
ba521dd88f
chore: add missing imports to Office type tests ( #826 )
...
* chore: add missing import to XLSX test
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Update test_backend_msword.py [skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Update test_backend_pptx.py
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-28 16:17:44 +01:00
Panos Vagenas
6875913e34
docs: document Docling JSON parsing ( #819 )
...
* docs: document Docling JSON parsing
Also:
- factored out and expanded supported formats
- reorged feature list
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* update feature list, minor fixes
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-28 13:23:30 +01:00
Anastas Stoyanovsky
5139b48e4e
docs: Add SSL verification error mitigation ( #821 )
...
Add SSL verification error mitigation
Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>
2025-01-28 07:22:43 +01:00
Michele Dolfi
6882e6c38d
feat(CLI): Expose code and formula models in the CLI ( #820 )
...
feat: expose code and formula models in the CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-28 06:26:03 +01:00
Cesar Berrospi Ramis
4d41db3f7a
docs(backend XML): do not delete temp file in notebook ( #817 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-27 18:53:39 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-27 16:59:00 +01:00
Panos Vagenas
95b293a723
feat: add platform info to CLI version printout ( #816 )
...
* feat: add platform info to CLI version printout
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Update main.py
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* add Python implementation & language versions
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-27 16:04:57 +01:00
Yorick Terweijden
53327552e8
feat(ocr): expose rec_keys_path
in RapidOcrOptions to support custom dictionaries ( #786 )
...
* Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries
- Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries.
- Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters).
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* style(rapidocr-options): fix alignment of `rec_keys_path` comment
Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made.
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
---------
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
2025-01-27 13:38:15 +01:00
Michele Dolfi
9022c6d855
chore: update deps in lockfile ( #815 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-27 12:41:18 +01:00
Farzad Sunavala
8a4ec77576
docs: typo ( #814 )
...
* Update rag_azuresearch.ipynb
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
* typo
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
---------
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
2025-01-27 11:24:26 +01:00
Farzad Sunavala
b885b2fa3c
docs: added markdown headings to enable TOC in github pages ( #808 )
...
* docs: added markdown headings to enable TOC in github pages
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
* minor renames
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
* part 3 heading
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
---------
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
2025-01-27 09:40:35 +01:00
Cesar Berrospi Ramis
c2ae1cc4ca
docs: description of supported formats and backends ( #788 )
...
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-26 08:10:33 +01:00
Nikos Livathinos
3be2fb581f
feat: Introduce automatic language detection in TesseractOcrCliModel ( #800 )
...
* feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* docs: Add example how to use "auto" language with tesseract OCR engines
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected
language is installed in the system and if not fall back to a default option without language.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-01-26 08:07:56 +01:00
github-actions[bot]
9e4ca90db1
chore: bump version to 2.16.0 [skip ci]
2025-01-24 18:21:14 +00:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel ( #804 )
...
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local>
* fixed some issues
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* pinned pillow in pyproject
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-24 18:48:02 +01:00
Matteo
16a218d871
feat: New document picture classifier ( #805 )
...
* figure classifier
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* gt for e2e tests
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* tests
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
---------
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
2025-01-24 18:05:51 +01:00
Panos Vagenas
88a0e66adc
feat: add Docling JSON ingestion ( #783 )
...
* feat: add Docling JSON ingestion
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Update docling/backend/json/docling_json_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-24 18:05:23 +01:00
Yusik Kim
e9768ae6a5
chore: expose draw_clusters function ( #803 )
...
feat: expose draw_clusters function
add type annotations to function signature
Signed-off-by: Yusik Kim <kmyusk@gmail.com>
2025-01-24 17:35:29 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown ( #752 )
...
* propagated changes for new CodeItem class
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* Rebased branch on latest main. changes for CodeItem
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed unused files
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* chore: update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* pin latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docling-core pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use new add_code in backends and update typing in MD backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* added if statement for backend
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed unused import
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed print statements
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* gt for new pdf
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* Update docling/pipeline/standard_pdf_pipeline.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
* fixed doc comment of __call__ function of code_formula_model
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* fix artifacts_path type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move expansion_factor to base class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-24 16:54:22 +01:00
Farzad Sunavala
c58f75d0f7
docs: fix minor typos ( #801 )
...
Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
2025-01-24 16:27:05 +01:00
Farzad Sunavala
9020a934be
docs: add Azure RAG example ( #675 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Farzad Sunavala <fsunavala@microsoft.com>
2025-01-24 13:56:26 +01:00
Pavel Denisov
8543c22687
feat: add "auto" language for TesseractOcr ( #759 )
...
* Add "auto" language for TesseractOcr
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Add tesseract-ocr-script-latn installation for the "auto" language
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Modify "auto" language in TesseractOcr to initialize the script readers lazily
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Finalize script readers
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Fix script models prefix for Linux
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
---------
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
2025-01-23 12:40:50 +01:00
Michele Dolfi
c49b3526fb
docs: fix links between docs pages ( #697 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-20 09:52:59 +01:00
Selvam Palanimalai
e4c7210133
ci: added action to generate llms.txt ( #701 )
...
* ci: added action in docs.yml to generate llms.txt
Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com>
* ci: pinning llms-txt action version as per PR feedback
Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com>
---------
Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com>
2025-01-20 09:52:27 +01:00
Christoph Auer
670a08bded
fix: Update docling-parse-v2 backend version with new parsing fixes ( #769 )
...
* chore: Update lockfile with docling-parse git branch
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Final docling-parse pinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-20 09:00:57 +01:00
Iacopo Ghinassi
768608351d
docs: fix correct Accelerator pipeline options in docs/examples/custom_convert.py ( #733 )
...
* Update custom_convert.py
Added the missing AcceleratorDevice and AcceleratorOptions functions in the imports and changed Device in the code to the correct AcceleratorDevice
Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com>
* apply formatting
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-19 16:55:26 +01:00
Michele Dolfi
57fc28d3d8
refactor: allow the usage of backends in the enrich models and generalize the interface ( #742 )
...
* fix get image with cropbox
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* allow the usage of backends in the enrich models and generalize the interface
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move logic in BaseTextImageEnrichmentModel
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-15 09:52:38 +01:00
Peter W. J. Staar
f7e1cbf629
docs: Example to translate documents ( #739 )
...
* added example to translate documents
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the mkdocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix PR hooks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-15 06:51:15 +01:00
github-actions[bot]
1976584be1
chore: bump version to 2.15.1 [skip ci]
2025-01-10 10:29:32 +00:00
Christoph Auer
5a060f237d
fix: Improve OCR results, stricten criteria before dropping bitmap areas ( #719 )
...
fix: Properly care for all bitmap elements in OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-10 10:38:49 +01:00
Panos Vagenas
9a6b5c8c8d
docs: add pointers to LangChain-side docs ( #718 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 17:36:46 +01:00
Panos Vagenas
4fa8028bd8
docs: add LangChain docs ( #717 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 14:12:05 +01:00
Michele Dolfi
e64b5a2f62
fix: allow earlier requests versions ( #716 )
...
allow earlier requests versions
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-09 13:30:40 +01:00
github-actions[bot]
9a94b54f6c
chore: bump version to 2.15.0 [skip ci]
2025-01-08 12:06:38 +00:00
Christoph Auer
5cb4cf6f19
fix: Correct scaling of debug visualizations, tune OCR ( #700 )
...
* fix: Correct scaling of debug visualizations, tune OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: remove unused imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* chore: Update docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-08 12:26:44 +01:00
Michele Dolfi
ead396ab40
docs: specify docstring types ( #702 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-08 09:05:18 +01:00
Michele Dolfi
6701f34c85
docs: add link to rag with granite ( #698 )
...
* docs: add link to rag with granite
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update mkdocs.yml
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-07 20:01:41 +01:00
Christoph Auer
42856fdf79
fix: Let BeautifulSoup detect the HTML encoding ( #695 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-07 15:49:28 +01:00
Panos Vagenas
2d24faecd9
docs: add integrations, revamp docs ( #693 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-07 14:15:54 +01:00
Jinfeng Sun
d49650c54f
fix(mspowerpoint): handle invalid images in PowerPoint slides ( #650 )
...
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load
Signed-off-by: Tendo33 <sjf1998112@gmail.com>
2025-01-07 13:58:10 +01:00
Luke Harrison
0ee849e8bc
feat: added http header support for document converter and cli ( #642 )
...
* added http header support for document converter and cli
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* fixed formatting and typing issues
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* use pydantic to parse dict
suggested by @dolfim-ibm
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
---------
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-07 10:15:14 +01:00
JSIV
569038df42
docs: Add OpenContracts as an integration ( #679 )
...
* Add OpenContracts as an open source project
OpenContracts now offers Docling as a document ingestion and parsing pipeline
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
* Update mkdocs.yml
Added OpenContracts to the nav configs
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
---------
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
2025-01-07 10:14:42 +01:00
m-newhauser
2b591f9872
docs: add Weaviate RAG recipe notebook ( #451 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-19 21:57:40 +01:00
Panos Vagenas
fc645ea531
docs: document Haystack & Vectara support ( #628 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-19 13:33:02 +01:00
github-actions[bot]
1418fa1488
chore: bump version to 2.14.0 [skip ci]
2024-12-18 07:04:47 +00:00
Lucas Morin
fd034802b6
feat: Create a backend to transform PubMed XML files to DoclingDocument ( #557 )
...
Signed-off-by: lucas-morin <lucas.morin222@gmail.com>
2024-12-17 19:27:09 +01:00
github-actions[bot]
e31f09f71f
chore: bump version to 2.13.0 [skip ci]
2024-12-17 17:01:04 +00:00