Docling

Author	SHA1	Message	Date
Peter W. J. Staar	926dfd29d5	feat: added excel backend (#334 ) * feat: added excel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first msexcel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tooling for the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working version for excel parsing of tables Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactor EXCEL to XLSX Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the unit tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran poetry lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding images to output [WIP] Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tests for merged cells in excel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2024-11-19 12:21:17 +01:00
Michele Dolfi	e6f89d520f	chore: update lock of deps (#371 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-19 10:23:59 +01:00
Maxim Lysak	7a97d7119f	feat: Extracting picture data for raster images found in PPTX (#349 ) * Added picture data for pptx pictures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tests for pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Inferring image DPI from pptx file Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-18 15:22:28 +01:00
Michele Dolfi	7dbdbdeaf3	ci: fix mergify (#350 ) * no conv commit message Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix mergify rules Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 17:13:01 +01:00
Michele Dolfi	364d37ca96	ci(Mergify): configuration update (#339 ) * ci(Mergify): configuration update Signed-off-by: Michele Dolfi <null> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove conventionalcommits from the checklist Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <null> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:33 +01:00
Michele Dolfi	ca8524ecae	docs: add automatic generation of CLI reference (#325 ) * docs: add automatic generation of CLI reference Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * install deps for building CLI ref Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:17 +01:00
Panos Vagenas	25fd149c38	docs: add architecture outline (#341 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-15 12:52:41 +01:00
Carl	835e077b02	docs: fix parameter in usage.md (#332 ) Signed-off-by: Carl Senze <carl.senze@aleph-alpha.com> Co-authored-by: Carl Senze <carl.senze@aleph-alpha.com>	2024-11-15 09:24:15 +01:00
Maxim Lysak	8533039b0c	fix: Fixing images in the input Word files (#330 ) * Fixing images identification in the input Word files Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Populating extracted image data into docling picture for wordx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed base64 dependency in msword_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-14 13:33:34 +01:00
Panos Vagenas	bf2a85f1d4	chore: fix Qdrant notebook Colab link (#319 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-14 10:42:02 +01:00
Michele Dolfi	8b437adcde	fix: reduce logging by keeping option for more verbose (#323 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 10:08:24 +01:00
github-actions[bot]	5a44236ac2	chore: bump version to 2.5.2 [skip ci]	2024-11-13 08:19:09 +00:00
Michele Dolfi	c9341bf22e	fix: skip glm model downloads (#322 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 08:45:28 +01:00
github-actions[bot]	2c0c439a44	chore: bump version to 2.5.1 [skip ci]	2024-11-12 14:56:34 +00:00
Maxim Lysak	fb8ba861e2	fix: Handling of single-cell tables in DOCX backend (#314 ) * Handling of single-cell tables in DOCX backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * returned try-catch on tables handling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * proceed processing the content of single cell table as if its just part of the body Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example of trickly 1 cell table docx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-12 15:20:55 +01:00
Anush	7f5d35ea3c	docs: Hybrid RAG with Qdrant (#312 ) Signed-off-by: Anush008 <anushshetty90@gmail.com>	2024-11-12 15:18:14 +01:00
Panos Vagenas	93fc1be61a	docs: add Data Prep Kit integration (#316 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-12 12:21:48 +01:00
github-actions[bot]	777237ebc9	chore: bump version to 2.5.0 [skip ci]	2024-11-12 10:19:55 +00:00
Christoph Auer	5d4a10b121	fix: Configure env prefix for docling settings (#315 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-11-12 10:57:16 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Maxim Lysak	81c8243a8b	fix: Added handling of grouped elements in pptx backend (#307 ) * Added handling of grouped elements in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * updated log.warn to warning Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 16:38:21 +01:00
Maxim Lysak	53bf2d1790	Added handling of code blocks in html with <pre> tag (#302 ) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 15:00:11 +01:00
Panos Vagenas	1239ade275	docs: add navigation indices (#305 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-11 14:49:06 +01:00
Michele Dolfi	97f214efdd	fix: allow mps usage for easyocr (#286 ) * fix: allow mps usage for easyocr Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add example for cpu-only Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * comment out example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-10 14:26:17 +01:00
github-actions[bot]	be8aa17291	chore: bump version to 2.4.2 [skip ci]	2024-11-08 16:31:47 +00:00
Nikos Livathinos	0eb065e9b6	fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282 ) fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-08 16:48:41 +01:00
github-actions[bot]	118f162e64	chore: bump version to 2.4.1 [skip ci]	2024-11-08 12:37:36 +00:00
Nikos Livathinos	704d792a79	fix(tesserocr): Raise Exception if tesserocr has not loaded any languages (#279 ) fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-08 13:03:09 +01:00
Panos Vagenas	6c22cba0a7	chore: add issue templates (#251 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine	c3098e3c12	chore: fix typo (#241 ) * chore: update pypdfium2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_v2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> --------- Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>	2024-11-05 16:20:04 +01:00
Panos Vagenas	a84ec276b0	docs: update badges & credits (#248 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 13:57:06 +01:00
Anthony R	90836db90a	fix: Dockerfile example copy command (#234 ) Signed-off-by: Anthony R <anthonyringoet@gmail.com>	2024-11-05 12:48:27 +01:00
Panos Vagenas	5ce02c5c59	docs: add coming-soon section (#235 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 08:53:02 +01:00
Panos Vagenas	d5e65aedac	docs: add artifacts-path param to CLI (#233 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 08:51:21 +01:00
github-actions[bot]	e30a9c25a2	chore: bump version to 2.4.0 [skip ci]	2024-11-04 15:11:09 +00:00
Panos Vagenas	862d78d271	chore: update pyproject.toml metadata (#229 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 15:48:00 +01:00
Panos Vagenas	eeee3b4371	docs: add explicit artifacts path example (#224 ) * docs: add explicit artifacts path example [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * minor docs fix [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * touch to trigger needed checks Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 14:27:56 +01:00
Michele Dolfi	5f5fea90a9	docs: update custom convert and dockerfile (#226 ) * docs: remove old code from custom_convert.py Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * docs: update example Dockerfile Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-04 14:27:40 +01:00
Vicky Sekhon	41acaa9e2e	docs: correct spelling of 'individual' (#219 ) Signed-off-by: Vicky Sekhon <114193273+VickySekhon@users.noreply.github.com>	2024-11-04 14:27:02 +01:00
Michele Dolfi	40ad987303	feat: pdf backend, table mode as options and artifacts path (#203 ) * feat: add more options in the CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update CLI docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * expose artifacts-path as argument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-04 14:26:05 +01:00
Johnny Salazar	af323c04ef	fit: Specify encoding when writing output file (#214 ) Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252 Signed-off-by: Johnny Salazar <cepera.ang@gmail.com>	2024-11-04 14:24:13 +01:00
Panos Vagenas	8fb445f46c	chore: make tests lighter (#228 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 14:02:28 +01:00
Panos Vagenas	244ca69cfd	docs: update LlamaIndex docs (#196 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-01 20:55:28 +01:00
github-actions[bot]	9d8865856d	chore: bump version to 2.3.1 [skip ci]	2024-10-30 18:23:53 +00:00
Michele Dolfi	eb679ccbb4	fix: simplify torch dependencies and update pinned docling deps (#190 ) * fix: simplify torch dependencies and update pinned docling deps Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docling-ibm-models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-10-30 18:44:08 +01:00
Michele Dolfi	904d24d600	fix: allow to explicitly initialize the pipeline (#189 ) * feat: allow to explicitly initialize the pipeline Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * clean examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-10-30 17:54:53 +01:00
github-actions[bot]	43349865d0	chore: bump version to 2.3.0 [skip ci]	2024-10-30 14:47:37 +00:00
Christoph Auer	2a2c65bf4f	feat: Add pipeline timings and toggle visualization, establish debug settings (#183 ) * Add settings to turn visualization on or off Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add profiling code to all models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Refactor and fix profiling codes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Visualization codes output PNG to debug dir Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for time logging Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Optimize imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add start_timestamps to ProfilingItem Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-10-30 15:04:19 +01:00
Peter W. J. Staar	94a5290789	chore: update the with input formats and DoclingDocument (#188 ) --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2024-10-30 15:02:28 +01:00
Peter W. J. Staar	f542460af3	fix: fix duplicate title and heading + add e2e tests for html and docx (#186 ) * add real e2e tests for html and docx Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the output of itxt Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the text Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the examples (1) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the output of the test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the tests, moved the ground-truth Signed-off-by: Peter Staar <taa@zurich.ibm.com> * moved the ground-truth data Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the html tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * restructure title fix (#187) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-30 13:14:56 +01:00

1 2 3 4 5

209 Commits