Michele Dolfi
e6f89d520f
chore: update lock of deps ( #371 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-19 10:23:59 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX ( #349 )
...
* Added picture data for pptx pictures
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added tests for pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Inferring image DPI from pptx file
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-18 15:22:28 +01:00
Michele Dolfi
7dbdbdeaf3
ci: fix mergify ( #350 )
...
* no conv commit message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix mergify rules
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 17:13:01 +01:00
Michele Dolfi
364d37ca96
ci(Mergify): configuration update ( #339 )
...
* ci(Mergify): configuration update
Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove conventionalcommits from the checklist
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <null>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:33 +01:00
Michele Dolfi
ca8524ecae
docs: add automatic generation of CLI reference ( #325 )
...
* docs: add automatic generation of CLI reference
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* install deps for building CLI ref
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-15 13:18:17 +01:00
Panos Vagenas
25fd149c38
docs: add architecture outline ( #341 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-15 12:52:41 +01:00
Carl
835e077b02
docs: fix parameter in usage.md ( #332 )
...
Signed-off-by: Carl Senze <carl.senze@aleph-alpha.com>
Co-authored-by: Carl Senze <carl.senze@aleph-alpha.com>
2024-11-15 09:24:15 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-14 13:33:34 +01:00
Panos Vagenas
bf2a85f1d4
chore: fix Qdrant notebook Colab link ( #319 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-14 10:42:02 +01:00
Michele Dolfi
8b437adcde
fix: reduce logging by keeping option for more verbose ( #323 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 10:08:24 +01:00
github-actions[bot]
5a44236ac2
chore: bump version to 2.5.2 [skip ci]
2024-11-13 08:19:09 +00:00
Michele Dolfi
c9341bf22e
fix: skip glm model downloads ( #322 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-13 08:45:28 +01:00
github-actions[bot]
2c0c439a44
chore: bump version to 2.5.1 [skip ci]
2024-11-12 14:56:34 +00:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Anush
7f5d35ea3c
docs: Hybrid RAG with Qdrant ( #312 )
...
Signed-off-by: Anush008 <anushshetty90@gmail.com>
2024-11-12 15:18:14 +01:00
Panos Vagenas
93fc1be61a
docs: add Data Prep Kit integration ( #316 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-12 12:21:48 +01:00
github-actions[bot]
777237ebc9
chore: bump version to 2.5.0 [skip ci]
2024-11-12 10:19:55 +00:00
Christoph Auer
5d4a10b121
fix: Configure env prefix for docling settings ( #315 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-12 10:57:16 +01:00
Nikos Livathinos
c6b3763ecb
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ( #290 )
...
- When the OCR is forced, any existing PDF cells are rejected.
- Introduce the force-ocr cmd parameter in docling CLI.
- Update unit tests.
- Add the full_page_ocr.py example in mkdocs.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-12 09:46:14 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-11 15:00:11 +01:00
Panos Vagenas
1239ade275
docs: add navigation indices ( #305 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-11 14:49:06 +01:00
Michele Dolfi
97f214efdd
fix: allow mps usage for easyocr ( #286 )
...
* fix: allow mps usage for easyocr
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add example for cpu-only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* comment out example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-10 14:26:17 +01:00
github-actions[bot]
be8aa17291
chore: bump version to 2.4.2 [skip ci]
2024-11-08 16:31:47 +00:00
Nikos Livathinos
0eb065e9b6
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ( #282 )
...
fix(EasyOcrModel): Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr without GPU if MPS is available.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 16:48:41 +01:00
github-actions[bot]
118f162e64
chore: bump version to 2.4.1 [skip ci]
2024-11-08 12:37:36 +00:00
Nikos Livathinos
704d792a79
fix(tesserocr): Raise Exception if tesserocr has not loaded any languages ( #279 )
...
fix(TesseractOcrModel): Raise Exception if tesserocr has not loaded any languages. Provide a descriptive error message.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-11-08 13:03:09 +01:00
Panos Vagenas
6c22cba0a7
chore: add issue templates ( #251 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 23:18:20 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
2024-11-05 16:20:04 +01:00
Panos Vagenas
a84ec276b0
docs: update badges & credits ( #248 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00
Anthony R
90836db90a
fix: Dockerfile example copy command ( #234 )
...
Signed-off-by: Anthony R <anthonyringoet@gmail.com>
2024-11-05 12:48:27 +01:00
Panos Vagenas
5ce02c5c59
docs: add coming-soon section ( #235 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:53:02 +01:00
Panos Vagenas
d5e65aedac
docs: add artifacts-path param to CLI ( #233 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:51:21 +01:00
github-actions[bot]
e30a9c25a2
chore: bump version to 2.4.0 [skip ci]
2024-11-04 15:11:09 +00:00
Panos Vagenas
862d78d271
chore: update pyproject.toml metadata ( #229 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 15:48:00 +01:00
Panos Vagenas
eeee3b4371
docs: add explicit artifacts path example ( #224 )
...
* docs: add explicit artifacts path example
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* minor docs fix
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* touch to trigger needed checks
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 14:27:56 +01:00
Michele Dolfi
5f5fea90a9
docs: update custom convert and dockerfile ( #226 )
...
* docs: remove old code from custom_convert.py
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* docs: update example Dockerfile
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:27:40 +01:00
Vicky Sekhon
41acaa9e2e
docs: correct spelling of 'individual' ( #219 )
...
Signed-off-by: Vicky Sekhon <114193273+VickySekhon@users.noreply.github.com>
2024-11-04 14:27:02 +01:00
Michele Dolfi
40ad987303
feat: pdf backend, table mode as options and artifacts path ( #203 )
...
* feat: add more options in the CLI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update CLI docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* expose artifacts-path as argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-04 14:26:05 +01:00
Johnny Salazar
af323c04ef
fit: Specify encoding when writing output file ( #214 )
...
Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252
Signed-off-by: Johnny Salazar <cepera.ang@gmail.com>
2024-11-04 14:24:13 +01:00
Panos Vagenas
8fb445f46c
chore: make tests lighter ( #228 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-04 14:02:28 +01:00
Panos Vagenas
244ca69cfd
docs: update LlamaIndex docs ( #196 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-01 20:55:28 +01:00
github-actions[bot]
9d8865856d
chore: bump version to 2.3.1 [skip ci]
2024-10-30 18:23:53 +00:00
Michele Dolfi
eb679ccbb4
fix: simplify torch dependencies and update pinned docling deps ( #190 )
...
* fix: simplify torch dependencies and update pinned docling deps
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docling-ibm-models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-30 18:44:08 +01:00
Michele Dolfi
904d24d600
fix: allow to explicitly initialize the pipeline ( #189 )
...
* feat: allow to explicitly initialize the pipeline
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* clean examples
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-30 17:54:53 +01:00
github-actions[bot]
43349865d0
chore: bump version to 2.3.0 [skip ci]
2024-10-30 14:47:37 +00:00
Christoph Auer
2a2c65bf4f
feat: Add pipeline timings and toggle visualization, establish debug settings ( #183 )
...
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-30 15:04:19 +01:00
Peter W. J. Staar
94a5290789
chore: update the with input formats and DoclingDocument ( #188 )
...
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-30 15:02:28 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00
github-actions[bot]
dda2645d4c
chore: bump version to 2.2.1 [skip ci]
2024-10-28 17:18:41 +00:00