Docling

Author	SHA1	Message	Date
nuridol	6efa96c983	feat: add support for `ocrmac` OCR engine on macOS (#276 ) * feat: add support for `ocrmac` OCR engine on macOS - Integrates `ocrmac` as an OCR engine option for macOS users. - Adds configuration options and dependencies for `ocrmac`. - Updates documentation to reflect new engine support. This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * updated the poetry lock Signed-off-by: Suhwan Seo <nuridol@gmail.com> * Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems - Resolved formatting and linting issues - Updated `--ocr-engine` CLI option documentation for `ocrmac` - Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms Signed-off-by: Suhwan Seo <nuridol@gmail.com> * feat: add support for `ocrmac` OCR engine on macOS - Integrates `ocrmac` as an OCR engine option for macOS users. - Adds configuration options and dependencies for `ocrmac`. - Updates documentation to reflect new engine support. This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * docs: update examples and installation for ocrmac support - Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples. - Included usage comments and examples for `OcrMacOptions` in OCR pipelines. - Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+). - Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend. This enhances documentation for users working on macOS to leverage `ocrmac` effectively. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * fix: update `ocrmac` dependency with macOS-specific marker - Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility. - Updated the content hash in `poetry.lock` to reflect the changes. This ensures the `ocrmac` dependency is only installed on macOS systems. Signed-off-by: Suhwan Seo <nuridol@gmail.com> --------- Signed-off-by: Suhwan Seo <nuridol@gmail.com> Co-authored-by: Suhwan Seo <nuridol@gmail.com>	2024-11-20 12:51:19 +01:00
Michele Dolfi	32ebf55e33	fix: propagate document limits to converter (#388 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-20 08:36:51 +01:00
github-actions[bot]	2cfaceb787	chore: bump version to 2.6.0 [skip ci]	2024-11-19 16:07:34 +00:00
Shubham Gupta	3f91e7d3f1	feat: added support for exporting DocItem to an image when page image is available (#379 ) * Updated minimum docling-core version to 2.4.0 Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> * Deprecated the generate_table_images option Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> * Updated examples to use get_image instead of element.image Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> --------- Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>	2024-11-19 16:28:52 +01:00
Gaspard Petit	911c3bda27	docs: fixed typo in v2 example v2 (#378 ) Update v2.md - fixed typo in example: iterate_items -> iterate_items() Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>	2024-11-19 16:27:19 +01:00
Michele Dolfi	ed785ea122	feat: expose ocr-lang in CLI (#375 ) * feat: expose ocr-lang in CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use regex for supporting multiple sep Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-19 15:58:49 +01:00
Peter W. J. Staar	926dfd29d5	feat: added excel backend (#334 ) * feat: added excel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first msexcel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tooling for the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working version for excel parsing of tables Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactor EXCEL to XLSX Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the unit tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran poetry lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding images to output [WIP] Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tests for merged cells in excel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2024-11-19 12:21:17 +01:00
Michele Dolfi	e6f89d520f	chore: update lock of deps (#371 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-19 10:23:59 +01:00
Maxim Lysak	7a97d7119f	feat: Extracting picture data for raster images found in PPTX (#349 ) * Added picture data for pptx pictures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tests for pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Inferring image DPI from pptx file Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-18 15:22:28 +01:00
Michele Dolfi	7dbdbdeaf3	ci: fix mergify (#350 ) * no conv commit message Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix mergify rules Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 17:13:01 +01:00
Michele Dolfi	364d37ca96	ci(Mergify): configuration update (#339 ) * ci(Mergify): configuration update Signed-off-by: Michele Dolfi <null> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove conventionalcommits from the checklist Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <null> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:33 +01:00
Michele Dolfi	ca8524ecae	docs: add automatic generation of CLI reference (#325 ) * docs: add automatic generation of CLI reference Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * install deps for building CLI ref Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:17 +01:00
Panos Vagenas	25fd149c38	docs: add architecture outline (#341 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-15 12:52:41 +01:00
Carl	835e077b02	docs: fix parameter in usage.md (#332 ) Signed-off-by: Carl Senze <carl.senze@aleph-alpha.com> Co-authored-by: Carl Senze <carl.senze@aleph-alpha.com>	2024-11-15 09:24:15 +01:00
Maxim Lysak	8533039b0c	fix: Fixing images in the input Word files (#330 ) * Fixing images identification in the input Word files Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Populating extracted image data into docling picture for wordx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed base64 dependency in msword_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-14 13:33:34 +01:00
Panos Vagenas	bf2a85f1d4	chore: fix Qdrant notebook Colab link (#319 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-14 10:42:02 +01:00
Michele Dolfi	8b437adcde	fix: reduce logging by keeping option for more verbose (#323 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 10:08:24 +01:00
github-actions[bot]	5a44236ac2	chore: bump version to 2.5.2 [skip ci]	2024-11-13 08:19:09 +00:00
Michele Dolfi	c9341bf22e	fix: skip glm model downloads (#322 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 08:45:28 +01:00
github-actions[bot]	2c0c439a44	chore: bump version to 2.5.1 [skip ci]	2024-11-12 14:56:34 +00:00
Maxim Lysak	fb8ba861e2	fix: Handling of single-cell tables in DOCX backend (#314 ) * Handling of single-cell tables in DOCX backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * returned try-catch on tables handling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * proceed processing the content of single cell table as if its just part of the body Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example of trickly 1 cell table docx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-12 15:20:55 +01:00
Anush	7f5d35ea3c	docs: Hybrid RAG with Qdrant (#312 ) Signed-off-by: Anush008 <anushshetty90@gmail.com>	2024-11-12 15:18:14 +01:00
Panos Vagenas	93fc1be61a	docs: add Data Prep Kit integration (#316 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-12 12:21:48 +01:00
github-actions[bot]	777237ebc9	chore: bump version to 2.5.0 [skip ci]	2024-11-12 10:19:55 +00:00
Christoph Auer	5d4a10b121	fix: Configure env prefix for docling settings (#315 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-11-12 10:57:16 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Maxim Lysak	81c8243a8b	fix: Added handling of grouped elements in pptx backend (#307 ) * Added handling of grouped elements in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * updated log.warn to warning Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 16:38:21 +01:00
Maxim Lysak	53bf2d1790	Added handling of code blocks in html with <pre> tag (#302 ) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 15:00:11 +01:00
Panos Vagenas	1239ade275	docs: add navigation indices (#305 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-11 14:49:06 +01:00
Michele Dolfi	97f214efdd	fix: allow mps usage for easyocr (#286 ) * fix: allow mps usage for easyocr Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add example for cpu-only Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * comment out example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-10 14:26:17 +01:00
github-actions[bot]	be8aa17291	chore: bump version to 2.4.2 [skip ci]	2024-11-08 16:31:47 +00:00
Nikos Livathinos	0eb065e9b6	fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282 ) fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-08 16:48:41 +01:00
github-actions[bot]	118f162e64	chore: bump version to 2.4.1 [skip ci]	2024-11-08 12:37:36 +00:00
Nikos Livathinos	704d792a79	fix(tesserocr): Raise Exception if tesserocr has not loaded any languages (#279 ) fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-08 13:03:09 +01:00
Panos Vagenas	6c22cba0a7	chore: add issue templates (#251 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine	c3098e3c12	chore: fix typo (#241 ) * chore: update pypdfium2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_v2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> --------- Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>	2024-11-05 16:20:04 +01:00
Panos Vagenas	a84ec276b0	docs: update badges & credits (#248 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 13:57:06 +01:00
Anthony R	90836db90a	fix: Dockerfile example copy command (#234 ) Signed-off-by: Anthony R <anthonyringoet@gmail.com>	2024-11-05 12:48:27 +01:00
Panos Vagenas	5ce02c5c59	docs: add coming-soon section (#235 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 08:53:02 +01:00
Panos Vagenas	d5e65aedac	docs: add artifacts-path param to CLI (#233 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-05 08:51:21 +01:00
github-actions[bot]	e30a9c25a2	chore: bump version to 2.4.0 [skip ci]	2024-11-04 15:11:09 +00:00
Panos Vagenas	862d78d271	chore: update pyproject.toml metadata (#229 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 15:48:00 +01:00
Panos Vagenas	eeee3b4371	docs: add explicit artifacts path example (#224 ) * docs: add explicit artifacts path example [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * minor docs fix [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * touch to trigger needed checks Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 14:27:56 +01:00
Michele Dolfi	5f5fea90a9	docs: update custom convert and dockerfile (#226 ) * docs: remove old code from custom_convert.py Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * docs: update example Dockerfile Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-04 14:27:40 +01:00
Vicky Sekhon	41acaa9e2e	docs: correct spelling of 'individual' (#219 ) Signed-off-by: Vicky Sekhon <114193273+VickySekhon@users.noreply.github.com>	2024-11-04 14:27:02 +01:00
Michele Dolfi	40ad987303	feat: pdf backend, table mode as options and artifacts path (#203 ) * feat: add more options in the CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update CLI docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * expose artifacts-path as argument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-04 14:26:05 +01:00
Johnny Salazar	af323c04ef	fit: Specify encoding when writing output file (#214 ) Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252 Signed-off-by: Johnny Salazar <cepera.ang@gmail.com>	2024-11-04 14:24:13 +01:00
Panos Vagenas	8fb445f46c	chore: make tests lighter (#228 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-04 14:02:28 +01:00
Panos Vagenas	244ca69cfd	docs: update LlamaIndex docs (#196 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-11-01 20:55:28 +01:00
github-actions[bot]	9d8865856d	chore: bump version to 2.3.1 [skip ci]	2024-10-30 18:23:53 +00:00

... 3 4 5 6 7 ...

415 Commits