Docling

Author	SHA1	Message	Date
Panos Vagenas	9f28abf061	docs: add advanced chunking & serialization example (#1589 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-05-14 14:35:07 +02:00
Panos Vagenas	3220a592e7	docs: add serialization docs, update chunking docs (#1556 ) * docs: add serializers docs, update chunking docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update notebook to improve MD table rendering Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-05-08 21:43:01 +02:00
Ryan Lin	a2fbbba9f7	feat: add tutorial using Milvus and Docling for RAG pipeline (#1449 ) * feat: add milvus rag with docling tutorial Signed-off-by: Ryan Lin <linjinhong@yandex.com> * chore: run pre-commit Signed-off-by: Ryan Lin <linjinhong@yandex.com> * feat: add RAG with Milvus example to mkdocs Signed-off-by: Ryan Lin <linjinhong@yandex.com> --------- Signed-off-by: Ryan Lin <linjinhong@yandex.com>	2025-04-25 09:12:35 +02:00
Gabe Goodhart	c605edd8e9	feat: OllamaVlmModel for Granite Vision 3.2 (#1337 ) * build: Add ollama sdk dependency Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add option plumbing for OllamaVlmOptions in pipeline_options Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full implementation of OllamaVlmModel Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Connect "granite_vision_ollama" pipeline option to CLI Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "build: Add ollama sdk dependency" After consideration, we're going to use the generic OpenAI API instead of the Ollama-specific API to avoid duplicate work. This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move OpenAI API call logic into utils.utils This will allow reuse of this logic in a generic VLM model NOTE: There is a subtle change here in the ordering of the text prompt and the image in the call to the OpenAI API. When run against Ollama, this ordering makes a big difference. If the prompt comes before the image, the result is terse and not usable whereas the prompt coming after the image works as expected and matches the non-OpenAI chat API. Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor from Ollama SDK to generic OpenAI API Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Linting, formatting, and bug fixes The one bug fix was in the timeout arg to openai_image_request. Otherwise, this is all style changes to get MyPy and black passing cleanly. Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * remove model from download enum Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * generalize input args for other API providers Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename and refactor Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * require flag for remote services Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * disable example from CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add examples to docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-10 18:03:04 +02:00
Michele Dolfi	2e99e5a54f	docs: add plugins docs (#1319 ) add plugin docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-08 09:44:37 +02:00
Panos Vagenas	71148eb381	docs: add visual grounding example (#1270 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-04-02 14:03:19 +02:00
Michele Dolfi	54a78c307d	docs: move apify to docs (#1182 ) move apify to docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-03-18 16:43:55 +01:00
Michele Dolfi	fa16b12316	chore: move to docling-project org (#1160 ) * chore: rename org Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update docs/faq/index.md Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * update github pages Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * revert test content Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-03-14 12:35:29 +01:00
Michele Dolfi	357d41cc47	docs: Enrichment models (#1097 ) * warning for develop examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add docs for enrichment models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * minor reorg of top-level docs (#1098) * minor reorg of top-level docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix typo [no ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * trigger ci Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-03-04 14:24:38 +01:00
Panos Vagenas	27c04007bc	docs: revamp picture description example (#1015 ) * docs: revamp picture description example Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * Improvements for visualization example (#1017) * fix colab install, use granite and improve viz of description Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * switch docs to notbook Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * show results with all models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * show other vlm Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-19 11:28:54 +01:00
Tobias Strebitzer	00d9405b0a	feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 ) * feat: Implement csv backend and format detection Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * test: Implement csv parsing and format tests Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * docs: Add example and CSV format documentation Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * feat: Add support for various CSV dialects and update documentation Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * feat: Add validation for delimiters and tests for inconsistent csv files Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> --------- Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>	2025-02-14 08:55:09 +01:00
Michele Dolfi	2d66e99b69	docs: Examples for picture descriptions (#951 ) * add more examples for picture descriptions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix merge typo Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-13 08:33:12 +01:00
Nikos Livathinos	6d3fea0196	docs: Introduce example with custom models for RapidOCR (#874 ) * docs: Introduce example with custom models for RapidOCR Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Exclude the example with custom RapidOCR models from the examples to run in github actions Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-02-04 10:07:00 +01:00
Panos Vagenas	6875913e34	docs: document Docling JSON parsing (#819 ) * docs: document Docling JSON parsing Also: - factored out and expanded supported formats - reorged feature list Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update feature list, minor fixes Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 13:23:30 +01:00
Cesar Berrospi Ramis	c2ae1cc4ca	docs: description of supported formats and backends (#788 ) * chore: remove type-ignore marks for attaching text to non GroupItems After commit b74208 of docling-core, text items can be attached to any NodeItem and therefore the ignore[arg-type] type marks can be removed. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: remove unnecessary imports Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add documentation on supported formats and backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add notebook example with XML backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-26 08:10:33 +01:00
Nikos Livathinos	3be2fb581f	feat: Introduce automatic language detection in TesseractOcrCliModel (#800 ) * feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: Add example how to use "auto" language with tesseract OCR engines Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected language is installed in the system and if not fall back to a default option without language. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-01-26 08:07:56 +01:00
Farzad Sunavala	9020a934be	docs: add Azure RAG example (#675 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Farzad Sunavala <fsunavala@microsoft.com>	2025-01-24 13:56:26 +01:00
Peter W. J. Staar	f7e1cbf629	docs: Example to translate documents (#739 ) * added example to translate documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the mkdocs Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fix PR hooks Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 06:51:15 +01:00
Panos Vagenas	4fa8028bd8	docs: add LangChain docs (#717 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-09 14:12:05 +01:00
Michele Dolfi	6701f34c85	docs: add link to rag with granite (#698 ) * docs: add link to rag with granite Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update mkdocs.yml Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 20:01:41 +01:00
Panos Vagenas	2d24faecd9	docs: add integrations, revamp docs (#693 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 14:15:54 +01:00
JSIV	569038df42	docs: Add OpenContracts as an integration (#679 ) * Add OpenContracts as an open source project OpenContracts now offers Docling as a document ingestion and parsing pipeline Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> * Update mkdocs.yml Added OpenContracts to the nav configs Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> --------- Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>	2025-01-07 10:14:42 +01:00
m-newhauser	2b591f9872	docs: add Weaviate RAG recipe notebook (#451 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 21:57:40 +01:00
Panos Vagenas	fc645ea531	docs: document Haystack & Vectara support (#628 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 13:33:02 +01:00
Panos Vagenas	3e599c7bbe	docs: add Haystack RAG example (#615 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-17 14:24:40 +01:00
Nikos Livathinos	3bb3bf5715	docs: Fix the path to the run_with_accelerator.py example (#608 ) docs: Fix the path to the run_with_accelerator.py example Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-12-16 15:03:06 +01:00
Nikos Livathinos	19fad9261c	feat: Introduce support for GPU Accelerators (#593 ) * Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Updated test ground-truth Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Rollback changes from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test gt Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unused debug settings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Review fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Nail the accelerator defaults for MPS Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-12-13 17:45:22 +01:00
Panos Vagenas	d0c9e8e508	docs: update chunking usage docs, minor reorg (#550 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-10 16:03:02 +01:00
Panos Vagenas	c8ecdd987e	feat: expose new hybrid chunker, update docs (#384 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-09 08:28:29 +01:00
Panos Vagenas	e780333440	docs: document new integrations (#532 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-06 13:18:14 +01:00
Peter W. J. Staar	0d11e30dd8	fix: Enable HTML export in CLI and add options for image mode (#513 ) * updated README Signed-off-by: Peter Staar <taa@zurich.ibm.com> * removed duck in title Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the index.md Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the cli to export html Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added html to cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * removed the duck emoji, added the in the cli. Currently, the referenced seems broken Signed-off-by: Peter Staar <taa@zurich.ibm.com> * cleaning up the comments Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reference is now working Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Clean up styling and docs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin docling-core>=2.7.1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-06 12:37:57 +01:00
Michele Dolfi	d4872103b8	docs: add automatic api reference (#475 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-12-02 09:55:52 +01:00
Michele Dolfi	8ccb3c6db6	docs: introduce faq section (#468 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-29 22:34:56 +01:00
Panos Vagenas	84c46fdeb3	docs: extend integration docs & README (#456 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-28 09:41:21 +01:00
Panos Vagenas	7a45b92078	docs: add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-21 17:23:04 +01:00
Michele Dolfi	ca8524ecae	docs: add automatic generation of CLI reference (#325 ) * docs: add automatic generation of CLI reference Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * install deps for building CLI ref Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:17 +01:00
Panos Vagenas	25fd149c38	docs: add architecture outline (#341 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-15 12:52:41 +01:00
Anush	7f5d35ea3c	docs: Hybrid RAG with Qdrant (#312 ) Signed-off-by: Anush008 <anushshetty90@gmail.com>	2024-11-12 15:18:14 +01:00
Panos Vagenas	93fc1be61a	docs: add Data Prep Kit integration (#316 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-12 12:21:48 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Panos Vagenas	1239ade275	docs: add navigation indices (#305 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-11 14:49:06 +01:00
Michele Dolfi	61c092f445	docs: add use docling (#150 ) --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-17 18:14:48 +02:00
Christoph Auer	7d3be0edeb	feat!: Docling v2 (#117 ) --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-16 21:02:03 +02:00
Panos Vagenas	d504432c1e	docs: introduce docs site (#141 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-14 14:13:13 +02:00

44 Commits