Docling

Author	SHA1	Message	Date
Farzad Sunavala	b885b2fa3c	docs: added markdown headings to enable TOC in github pages (#808 ) * docs: added markdown headings to enable TOC in github pages Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> * minor renames Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> * part 3 heading Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> --------- Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>	2025-01-27 09:40:35 +01:00
Cesar Berrospi Ramis	c2ae1cc4ca	docs: description of supported formats and backends (#788 ) * chore: remove type-ignore marks for attaching text to non GroupItems After commit b74208 of docling-core, text items can be attached to any NodeItem and therefore the ignore[arg-type] type marks can be removed. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: remove unnecessary imports Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add documentation on supported formats and backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add notebook example with XML backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-26 08:10:33 +01:00
Nikos Livathinos	3be2fb581f	feat: Introduce automatic language detection in TesseractOcrCliModel (#800 ) * feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: Add example how to use "auto" language with tesseract OCR engines Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected language is installed in the system and if not fall back to a default option without language. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-01-26 08:07:56 +01:00
github-actions[bot]	9e4ca90db1	chore: bump version to 2.16.0 [skip ci]	2025-01-24 18:21:14 +00:00
Peter W. J. Staar	a458e298ca	fix: added extraction of byte-images in excel (#804 ) * fix(msexcel): ignore Mypy checking for _find_images_in_sheet function Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> * fixed some issues Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * pinned pillow in pyproject Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-24 18:48:02 +01:00
Matteo	16a218d871	feat: New document picture classifier (#805 ) * figure classifier Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * gt for e2e tests Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * tests Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> --------- Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>	2025-01-24 18:05:51 +01:00
Panos Vagenas	88a0e66adc	feat: add Docling JSON ingestion (#783 ) * feat: add Docling JSON ingestion Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update docling/backend/json/docling_json_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-24 18:05:23 +01:00
Yusik Kim	e9768ae6a5	chore: expose draw_clusters function (#803 ) feat: expose draw_clusters function add type annotations to function signature Signed-off-by: Yusik Kim <kmyusk@gmail.com>	2025-01-24 17:35:29 +01:00
Matteo	3213b247ad	feat: Code and equation model for PDF and code blocks in markdown (#752 ) * propagated changes for new CodeItem class Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Rebased branch on latest main. changes for CodeItem Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused files Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * chore: update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pin latest docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docling-core pinning Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use new add_code in backends and update typing in MD backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * added if statement for backend Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused import Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed print statements Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * gt for new pdf Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Update docling/pipeline/standard_pdf_pipeline.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> * fixed doc comment of __call__ function of code_formula_model Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * fix artifacts_path type Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move expansion_factor to base class Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-24 16:54:22 +01:00
Farzad Sunavala	c58f75d0f7	docs: fix minor typos (#801 ) Signed-off-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>	2025-01-24 16:27:05 +01:00
Farzad Sunavala	9020a934be	docs: add Azure RAG example (#675 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Farzad Sunavala <fsunavala@microsoft.com>	2025-01-24 13:56:26 +01:00
Pavel Denisov	8543c22687	feat: add "auto" language for TesseractOcr (#759 ) * Add "auto" language for TesseractOcr Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Add tesseract-ocr-script-latn installation for the "auto" language Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Modify "auto" language in TesseractOcr to initialize the script readers lazily Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Finalize script readers Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> * Fix script models prefix for Linux Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de> --------- Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>	2025-01-23 12:40:50 +01:00
Michele Dolfi	c49b3526fb	docs: fix links between docs pages (#697 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-20 09:52:59 +01:00
Selvam Palanimalai	e4c7210133	ci: added action to generate llms.txt (#701 ) * ci: added action in docs.yml to generate llms.txt Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com> * ci: pinning llms-txt action version as per PR feedback Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com> --------- Signed-off-by: Selvam Palanimalai <selvam.palanimalai@gmail.com>	2025-01-20 09:52:27 +01:00
Christoph Auer	670a08bded	fix: Update docling-parse-v2 backend version with new parsing fixes (#769 ) * chore: Update lockfile with docling-parse git branch Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Final docling-parse pinning Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-20 09:00:57 +01:00
Iacopo Ghinassi	768608351d	docs: fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733 ) * Update custom_convert.py Added the missing AcceleratorDevice and AcceleratorOptions functions in the imports and changed Device in the code to the correct AcceleratorDevice Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com> * apply formatting Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Iacopo Ghinassi <45108036+Ighina@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-19 16:55:26 +01:00
Michele Dolfi	57fc28d3d8	refactor: allow the usage of backends in the enrich models and generalize the interface (#742 ) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 09:52:38 +01:00
Peter W. J. Staar	f7e1cbf629	docs: Example to translate documents (#739 ) * added example to translate documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the mkdocs Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fix PR hooks Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 06:51:15 +01:00
github-actions[bot]	1976584be1	chore: bump version to 2.15.1 [skip ci]	2025-01-10 10:29:32 +00:00
Christoph Auer	5a060f237d	fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719 ) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-10 10:38:49 +01:00
Panos Vagenas	9a6b5c8c8d	docs: add pointers to LangChain-side docs (#718 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-09 17:36:46 +01:00
Panos Vagenas	4fa8028bd8	docs: add LangChain docs (#717 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-09 14:12:05 +01:00
Michele Dolfi	e64b5a2f62	fix: allow earlier requests versions (#716 ) allow earlier requests versions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-09 13:30:40 +01:00
github-actions[bot]	9a94b54f6c	chore: bump version to 2.15.0 [skip ci]	2025-01-08 12:06:38 +00:00
Christoph Auer	5cb4cf6f19	fix: Correct scaling of debug visualizations, tune OCR (#700 ) * fix: Correct scaling of debug visualizations, tune OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: remove unused imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update docling-core Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-08 12:26:44 +01:00
Michele Dolfi	ead396ab40	docs: specify docstring types (#702 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-08 09:05:18 +01:00
Michele Dolfi	6701f34c85	docs: add link to rag with granite (#698 ) * docs: add link to rag with granite Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update mkdocs.yml Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 20:01:41 +01:00
Christoph Auer	42856fdf79	fix: Let BeautifulSoup detect the HTML encoding (#695 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-07 15:49:28 +01:00
Panos Vagenas	2d24faecd9	docs: add integrations, revamp docs (#693 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-07 14:15:54 +01:00
Jinfeng Sun	d49650c54f	fix(mspowerpoint): handle invalid images in PowerPoint slides (#650 ) - Add error handling for images that cannot be loaded by Pillow - Improve resilience when encountering corrupted or unsupported image formats - Maintain processing of other slide elements even if an image fails to load Signed-off-by: Tendo33 <sjf1998112@gmail.com>	2025-01-07 13:58:10 +01:00
Luke Harrison	0ee849e8bc	feat: added http header support for document converter and cli (#642 ) * added http header support for document converter and cli Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * fixed formatting and typing issues Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * use pydantic to parse dict suggested by @dolfim-ibm Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> --------- Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-07 10:15:14 +01:00
JSIV	569038df42	docs: Add OpenContracts as an integration (#679 ) * Add OpenContracts as an open source project OpenContracts now offers Docling as a document ingestion and parsing pipeline Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> * Update mkdocs.yml Added OpenContracts to the nav configs Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com> --------- Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>	2025-01-07 10:14:42 +01:00
m-newhauser	2b591f9872	docs: add Weaviate RAG recipe notebook (#451 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 21:57:40 +01:00
Panos Vagenas	fc645ea531	docs: document Haystack & Vectara support (#628 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-19 13:33:02 +01:00
github-actions[bot]	1418fa1488	chore: bump version to 2.14.0 [skip ci]	2024-12-18 07:04:47 +00:00
Lucas Morin	fd034802b6	feat: Create a backend to transform PubMed XML files to DoclingDocument (#557 ) Signed-off-by: lucas-morin <lucas.morin222@gmail.com>	2024-12-17 19:27:09 +01:00
github-actions[bot]	e31f09f71f	chore: bump version to 2.13.0 [skip ci]	2024-12-17 17:01:04 +00:00
Christoph Auer	60dc852f16	feat: Updated Layout processing with forms and key-value areas (#530 ) * Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Updated test ground-truth Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Correct the way to set GPU for EasyOCR, RapidOCR Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Ocr AccleratorDevice Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Merge pull request #556 from DS4SD/cau/layout-processing-improvement feat: layout processing improvements and bugfixes * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update HF model ref, reset test generate Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Repin to release package versions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Many layout processing improvements, add document index type Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update pinnings to docling-core Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix table box snapping Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for cluster pre-ordering Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Introduce OCR confidence, propagate to orphan in post-processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix form and key value area groups Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust confidence in EasyOcr Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Roll back CLI changes from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update docling-core pinning Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Annoying fixes for historical python versions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test GT for legacy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Comment cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-12-17 17:32:24 +01:00
Cesar Berrospi Ramis	00dec7a2f3	test: generate file from CLI in a temporary directory (#618 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2024-12-17 16:35:42 +01:00
Cesar Berrospi Ramis	4e087504cc	feat: create a backend to parse USPTO patents into DoclingDocument (#606 ) * feat: add PATENT_USPTO as input format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * feat: add USPTO backend parser Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: change the name of the USPTO input format Change the name of the patent USPTO input format to show the typical format (XML). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: address several input formats with same mime type Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: group XML backend parsers in a subfolder Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add safe initialization of PatentUsptoDocumentBackend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2024-12-17 16:35:23 +01:00
Panos Vagenas	3e599c7bbe	docs: add Haystack RAG example (#615 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-17 14:24:40 +01:00
itsainii	3b53bd38c8	feat: Add Easyocr parameter recog_network (#613 ) * Update easyocr_model.py Added this line of code to get recog_network of easyocr parameter recog_network = self.options.recog_network Signed-off-by: itsainii <aininawawii@gmail.com> * Update pipeline_options.py Added this line in EasyOcrOptions function recog_network: Optional[str] = 'standard' Signed-off-by: itsainii <aininawawii@gmail.com> * Add Easyocr recog_network parameter Signed-off-by: itsainii <aininawawii@gmail.com> --------- Signed-off-by: itsainii <aininawawii@gmail.com>	2024-12-17 09:47:18 +01:00
Nikos Livathinos	3bb3bf5715	docs: Fix the path to the run_with_accelerator.py example (#608 ) docs: Fix the path to the run_with_accelerator.py example Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-12-16 15:03:06 +01:00
github-actions[bot]	a2db5fbd0f	chore: bump version to 2.12.0 [skip ci]	2024-12-13 18:27:00 +00:00
Nikos Livathinos	19fad9261c	feat: Introduce support for GPU Accelerators (#593 ) * Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Updated test ground-truth Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Rollback changes from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test gt Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unused debug settings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Review fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Nail the accelerator defaults for MPS Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-12-13 17:45:22 +01:00
github-actions[bot]	365a1e7b98	chore: bump version to 2.11.0 [skip ci]	2024-12-12 08:16:05 +00:00
Abhishek Kumar	3da166eafa	feat: Add timeout limit to document parsing job. DS4SD#270 (#552 ) Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com> Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 36.29 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 58.56 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 59.88 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling Usage: docling [OPTIONS] source ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx\|pptx\|html\|image\|pdf\|asciido Specify input formats to convert │ │ c\|md\|xlsx] from. Defaults to all formats. │ │ [default: None] │ │ --to [md\|json\|html\|text\|doctags] Specify output formats. Defaults to │ │ Markdown. │ │ [default: None] │ │ --image-export-mode [placeholder\|embedded\|referenced] Image export mode for the document │ │ (only in case of JSON, Markdown or │ │ HTML). With `placeholder`, only the │ │ position of the image is marked in │ │ the output. In `embedded` mode, the │ │ image is embedded as base64 encoded │ │ string. In `referenced` mode, the │ │ image is exported in PNG format and │ │ referenced from the main exported │ │ document. │ │ [default: embedded] │ │ --ocr --no-ocr If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR │ │ generated text over the full │ │ content. │ │ [default: no-force-ocr] │ │ --ocr-engine [easyocr\|tesseract_cli\|tesseract\| The OCR engine to use. │ │ ocrmac\|rapidocr] [default: easyocr] │ │ --ocr-lang TEXT Provide a comma-separated list of │ │ languages used by the OCR engine. │ │ Note that each OCR engine has │ │ different values for the language │ │ names. │ │ [default: None] │ │ --pdf-backend [pypdfium2\|dlparse_v1\|dlparse_v2] The PDF backend to use. │ │ [default: dlparse_v2] │ │ --table-mode [fast\|accurate] The mode to use in the table │ │ structure model. │ │ [default: fast] │ │ --artifacts-path PATH If provided, the location of the │ │ model artifacts. │ │ [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: no-abort-on-error] │ │ --output PATH Output directory where results are │ │ saved. │ │ [default: .] │ │ --verbose -v INTEGER Set the verbosity level. -v for │ │ info logging, -vv for debug │ │ logging. │ │ [default: 0] │ │ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which │ │ visualizes the PDF cells │ │ [default: no-debug-visualize-cells] │ │ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which │ │ visualizes the OCR cells │ │ [default: no-debug-visualize-ocr] │ │ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which │ │ visualizes the layour clusters │ │ [default: │ │ no-debug-visualize-layout] │ │ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which │ │ visualizes the table cells │ │ [default: │ │ no-debug-visualize-tables] │ │ --version Show version information. │ │ --document-timeout FLOAT The timeout for processing each │ │ document, in seconds. │ │ [default: None] │ │ --help Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯	2024-12-11 15:06:10 +01:00
Christoph Auer	aee9c0b324	fix: Do not import python modules from deepsearch-glm (#569 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-11 12:29:06 +01:00
Christoph Auer	f45499ce93	fix: Handle no result from RapidOcr reader (#558 ) Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-12-10 16:25:05 +01:00
Panos Vagenas	d0c9e8e508	docs: update chunking usage docs, minor reorg (#550 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-10 16:03:02 +01:00

1 2 3 4 5 ...

408 Commits