Docling

Author	SHA1	Message	Date
Christoph Auer	f9144f2bb6	docs: Add example for inspection of picture content (#624 ) * chore: Add example for inspection of picture content Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Test case re-generation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Test case re-generation only on CPU Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add missing GT files Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-29 10:39:00 +01:00
github-actions[bot]	4d11d87d06	chore: bump version to 2.17.0 [skip ci]	2025-01-28 18:37:26 +00:00
Panos Vagenas	5aed9f8aeb	fix: fix single newline handling in MD backend (#824 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis	adf6353483	fix: use file extension if filetype fails with PDF (#827 ) Filetype library may not identify some files as PDF. Leverage the file extension as a simple solution. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-28 19:03:54 +01:00
Panos Vagenas	ba521dd88f	chore: add missing imports to Office type tests (#826 ) * chore: add missing import to XLSX test Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update test_backend_msword.py [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update test_backend_pptx.py Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 16:17:44 +01:00
Panos Vagenas	6875913e34	docs: document Docling JSON parsing (#819 ) * docs: document Docling JSON parsing Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update feature list, minor fixes Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 13:23:30 +01:00
Anastas Stoyanovsky	5139b48e4e	docs: Add SSL verification error mitigation (#821 ) Add SSL verification error mitigation Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>	2025-01-28 07:22:43 +01:00
Michele Dolfi	6882e6c38d	feat(CLI): Expose code and formula models in the CLI (#820 ) feat: expose code and formula models in the CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-28 06:26:03 +01:00
Cesar Berrospi Ramis	4d41db3f7a	docs(backend XML): do not delete temp file in notebook (#817 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-27 18:53:39 +01:00
Cesar Berrospi Ramis	a112d7a035	fix: parse html with omitted body tag (#818 ) * fix: parse HTML files without body tag Parse HTML files without 'body' tag, since it is optional in HTML5 specification. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: ensure docling converts HTML without body tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-27 16:59:00 +01:00
Panos Vagenas	95b293a723	feat: add platform info to CLI version printout (#816 ) * feat: add platform info to CLI version printout Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update main.py Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * add Python implementation & language versions Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-27 16:04:57 +01:00
Yorick Terweijden	53327552e8	feat(ocr): expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries (#786 ) * Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries - Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries. - Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters). Signed-off-by: Yorick Terweijden <yorick@spread.ai> * style(rapidocr-options): fix alignment of `rec_keys_path` comment Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made. Signed-off-by: Yorick Terweijden <yorick@spread.ai> --------- Signed-off-by: Yorick Terweijden <yorick@spread.ai>	2025-01-27 13:38:15 +01:00
Michele Dolfi	9022c6d855	chore: update deps in lockfile (#815 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-27 12:41:18 +01:00
Farzad Sunavala	8a4ec77576	docs: typo (#814 ) * Update rag_azuresearch.ipynb Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> * typo Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> --------- Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>	2025-01-27 11:24:26 +01:00
Farzad Sunavala	b885b2fa3c	docs: added markdown headings to enable TOC in github pages (#808 ) * docs: added markdown headings to enable TOC in github pages Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> * minor renames Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> * part 3 heading Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> --------- Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>	2025-01-27 09:40:35 +01:00
Cesar Berrospi Ramis	c2ae1cc4ca	docs: description of supported formats and backends (#788 ) * chore: remove type-ignore marks for attaching text to non GroupItems After commit b74208 of docling-core, text items can be attached to any NodeItem and therefore the ignore[arg-type] type marks can be removed. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: remove unnecessary imports Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add documentation on supported formats and backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add notebook example with XML backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-26 08:10:33 +01:00
Nikos Livathinos	3be2fb581f	feat: Introduce automatic language detection in TesseractOcrCliModel (#800 ) * feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: Add example how to use "auto" language with tesseract OCR engines Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected language is installed in the system and if not fall back to a default option without language. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-01-26 08:07:56 +01:00
github-actions[bot]	9e4ca90db1	chore: bump version to 2.16.0 [skip ci]	2025-01-24 18:21:14 +00:00
Peter W. J. Staar	a458e298ca	fix: added extraction of byte-images in excel (#804 ) * fix(msexcel): ignore Mypy checking for _find_images_in_sheet function Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> * fixed some issues Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * pinned pillow in pyproject Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-24 18:48:02 +01:00
Matteo	16a218d871	feat: New document picture classifier (#805 ) * figure classifier Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * gt for e2e tests Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * tests Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> --------- Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>	2025-01-24 18:05:51 +01:00
Panos Vagenas	88a0e66adc	feat: add Docling JSON ingestion (#783 ) * feat: add Docling JSON ingestion Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update docling/backend/json/docling_json_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-24 18:05:23 +01:00
Yusik Kim	e9768ae6a5	chore: expose draw_clusters function (#803 ) feat: expose draw_clusters function add type annotations to function signature Signed-off-by: Yusik Kim <kmyusk@gmail.com>	2025-01-24 17:35:29 +01:00
Matteo	3213b247ad	feat: Code and equation model for PDF and code blocks in markdown (#752 ) * propagated changes for new CodeItem class Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Rebased branch on latest main. changes for CodeItem Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused files Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * chore: update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pin latest docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docling-core pinning Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use new add_code in backends and update typing in MD backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * added if statement for backend Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused import Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed print statements Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * gt for new pdf Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Update docling/pipeline/standard_pdf_pipeline.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> * fixed doc comment of __call__ function of code_formula_model Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * fix artifacts_path type Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move expansion_factor to base class Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-24 16:54:22 +01:00
Farzad Sunavala	c58f75d0f7	docs: fix minor typos (#801 ) Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>	2025-01-24 16:27:05 +01:00
Farzad Sunavala	9020a934be	docs: add Azure RAG example (#675 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Farzad Sunavala <fsunavala@microsoft.com>	2025-01-24 13:56:26 +01:00
Pavel Denisov	8543c22687	feat: add "auto" language for TesseractOcr (#759 ) * Add "auto" language for TesseractOcr Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Add tesseract-ocr-script-latn installation for the "auto" language Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Modify "auto" language in TesseractOcr to initialize the script readers lazily Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Finalize script readers Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Fix script models prefix for Linux Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> --------- Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>	2025-01-23 12:40:50 +01:00
Michele Dolfi	c49b3526fb	docs: fix links between docs pages (#697 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-20 09:52:59 +01:00
Selvam Palanimalai	e4c7210133	ci: added action to generate llms.txt (#701 ) * ci: added action in docs.yml to generate llms.txt Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com> * ci: pinning llms-txt action version as per PR feedback Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com> --------- Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com>	2025-01-20 09:52:27 +01:00
Christoph Auer	670a08bded	fix: Update docling-parse-v2 backend version with new parsing fixes (#769 ) * chore: Update lockfile with docling-parse git branch Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Final docling-parse pinning Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-20 09:00:57 +01:00
Iacopo Ghinassi	768608351d	docs: fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733 ) * Update custom_convert.py Added the missing AcceleratorDevice and AcceleratorOptions functions in the imports and changed Device in the code to the correct AcceleratorDevice Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com> * apply formatting Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-19 16:55:26 +01:00
Michele Dolfi	57fc28d3d8	refactor: allow the usage of backends in the enrich models and generalize the interface (#742 ) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 09:52:38 +01:00
Peter W. J. Staar	f7e1cbf629	docs: Example to translate documents (#739 ) * added example to translate documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the mkdocs Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fix PR hooks Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 06:51:15 +01:00
github-actions[bot]	1976584be1	chore: bump version to 2.15.1 [skip ci]	2025-01-10 10:29:32 +00:00
Christoph Auer	5a060f237d	fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719 ) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-10 10:38:49 +01:00
Panos Vagenas	9a6b5c8c8d	docs: add pointers to LangChain-side docs (#718 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-09 17:36:46 +01:00
Panos Vagenas	4fa8028bd8	docs: add LangChain docs (#717 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-09 14:12:05 +01:00
Michele Dolfi	e64b5a2f62	fix: allow earlier requests versions (#716 ) allow earlier requests versions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-09 13:30:40 +01:00
github-actions[bot]	9a94b54f6c	chore: bump version to 2.15.0 [skip ci]	2025-01-08 12:06:38 +00:00
Christoph Auer	5cb4cf6f19	fix: Correct scaling of debug visualizations, tune OCR (#700 ) * fix: Correct scaling of debug visualizations, tune OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: remove unused imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update docling-core Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-08 12:26:44 +01:00
Michele Dolfi	ead396ab40	docs: specify docstring types (#702 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-08 09:05:18 +01:00
Michele Dolfi	6701f34c85	docs: add link to rag with granite (#698 ) * docs: add link to rag with granite Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update mkdocs.yml Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 20:01:41 +01:00
Christoph Auer	42856fdf79	fix: Let BeautifulSoup detect the HTML encoding (#695 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-07 15:49:28 +01:00
Panos Vagenas	2d24faecd9	docs: add integrations, revamp docs (#693 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 14:15:54 +01:00
Jinfeng Sun	d49650c54f	fix(mspowerpoint): handle invalid images in PowerPoint slides (#650 ) - Add error handling for images that cannot be loaded by Pillow - Improve resilience when encountering corrupted or unsupported image formats - Maintain processing of other slide elements even if an image fails to load Signed-off-by: Tendo33 <sjf1998112@gmail.com>	2025-01-07 13:58:10 +01:00
Luke Harrison	0ee849e8bc	feat: added http header support for document converter and cli (#642 ) * added http header support for document converter and cli Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * fixed formatting and typing issues Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * use pydantic to parse dict suggested by @dolfim-ibm Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> --------- Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-07 10:15:14 +01:00
JSIV	569038df42	docs: Add OpenContracts as an integration (#679 ) * Add OpenContracts as an open source project OpenContracts now offers Docling as a document ingestion and parsing pipeline Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> * Update mkdocs.yml Added OpenContracts to the nav configs Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> --------- Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>	2025-01-07 10:14:42 +01:00
m-newhauser	2b591f9872	docs: add Weaviate RAG recipe notebook (#451 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 21:57:40 +01:00
Panos Vagenas	fc645ea531	docs: document Haystack & Vectara support (#628 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 13:33:02 +01:00
github-actions[bot]	1418fa1488	chore: bump version to 2.14.0 [skip ci]	2024-12-18 07:04:47 +00:00
Lucas Morin	fd034802b6	feat: Create a backend to transform PubMed XML files to DoclingDocument (#557 ) Signed-off-by: lucas-morin <lucas.morin222@gmail.com>	2024-12-17 19:27:09 +01:00

1 2 3 4 5 ...

372 Commits