Commit Graph

415 Commits

Author SHA1 Message Date
github-actions[bot]
a2db5fbd0f chore: bump version to 2.12.0 [skip ci] 2024-12-13 18:27:00 +00:00
Nikos Livathinos
19fad9261c
feat: Introduce support for GPU Accelerators (#593)
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Updated test ground-truth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Rollback changes from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test gt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused debug settings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Review fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Nail the accelerator defaults for MPS

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-12-13 17:45:22 +01:00
github-actions[bot]
365a1e7b98 chore: bump version to 2.11.0 [skip ci] 2024-12-12 08:16:05 +00:00
Abhishek Kumar
3da166eafa
feat: Add timeout limit to document parsing job. DS4SD#270 (#552)
Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|image|pdf|asciido  Specify input formats to convert    │
│                                                              c|md|xlsx]                         from. Defaults to all formats.      │
│                                                                                                 [default: None]                     │
│ --to                                                         [md|json|html|text|doctags]        Specify output formats. Defaults to │
│                                                                                                 Markdown.                           │
│                                                                                                 [default: None]                     │
│ --image-export-mode                                          [placeholder|embedded|referenced]  Image export mode for the document  │
│                                                                                                 (only in case of JSON, Markdown or  │
│                                                                                                 HTML). With `placeholder`, only the │
│                                                                                                 position of the image is marked in  │
│                                                                                                 the output. In `embedded` mode, the │
│                                                                                                 image is embedded as base64 encoded │
│                                                                                                 string. In `referenced` mode, the   │
│                                                                                                 image is exported in PNG format and │
│                                                                                                 referenced from the main exported   │
│                                                                                                 document.                           │
│                                                                                                 [default: embedded]                 │
│ --ocr                         --no-ocr                                                          If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: ocr]                      │
│ --force-ocr                   --no-force-ocr                                                    Replace any existing text with OCR  │
│                                                                                                 generated text over the full        │
│                                                                                                 content.                            │
│                                                                                                 [default: no-force-ocr]             │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|  The OCR engine to use.              │
│                                                              ocrmac|rapidocr]                   [default: easyocr]                  │
│ --ocr-lang                                                   TEXT                               Provide a comma-separated list of   │
│                                                                                                 languages used by the OCR engine.   │
│                                                                                                 Note that each OCR engine has       │
│                                                                                                 different values for the language   │
│                                                                                                 names.                              │
│                                                                                                 [default: None]                     │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]  The PDF backend to use.             │
│                                                                                                 [default: dlparse_v2]               │
│ --table-mode                                                 [fast|accurate]                    The mode to use in the table        │
│                                                                                                 structure model.                    │
│                                                                                                 [default: fast]                     │
│ --artifacts-path                                             PATH                               If provided, the location of the    │
│                                                                                                 model artifacts.                    │
│                                                                                                 [default: None]                     │
│ --abort-on-error              --no-abort-on-error                                               If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: no-abort-on-error]        │
│ --output                                                     PATH                               Output directory where results are  │
│                                                                                                 saved.                              │
│                                                                                                 [default: .]                        │
│ --verbose                 -v                                 INTEGER                            Set the verbosity level. -v for     │
│                                                                                                 info logging, -vv for debug         │
│                                                                                                 logging.                            │
│                                                                                                 [default: 0]                        │
│ --debug-visualize-cells       --no-debug-visualize-cells                                        Enable debug output which           │
│                                                                                                 visualizes the PDF cells            │
│                                                                                                 [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                          Enable debug output which           │
│                                                                                                 visualizes the OCR cells            │
│                                                                                                 [default: no-debug-visualize-ocr]   │
│ --debug-visualize-layout      --no-debug-visualize-layout                                       Enable debug output which           │
│                                                                                                 visualizes the layour clusters      │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-layout]          │
│ --debug-visualize-tables      --no-debug-visualize-tables                                       Enable debug output which           │
│                                                                                                 visualizes the table cells          │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-tables]          │
│ --version                                                                                       Show version information.           │
│ --document-timeout                                           FLOAT                              The timeout for processing each     │
│                                                                                                 document, in seconds.               │
│                                                                                                 [default: None]                     │
│ --help                                                                                          Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-12-11 15:06:10 +01:00
Christoph Auer
aee9c0b324
fix: Do not import python modules from deepsearch-glm (#569)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-11 12:29:06 +01:00
Christoph Auer
f45499ce93
fix: Handle no result from RapidOcr reader (#558)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-12-10 16:25:05 +01:00
Panos Vagenas
d0c9e8e508
docs: update chunking usage docs, minor reorg (#550)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 16:03:02 +01:00
Michele Dolfi
a7df337654
fix: make enum serializable with human-readable value (#555)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:12:44 +01:00
github-actions[bot]
eb30c4f763 chore: bump version to 2.10.0 [skip ci] 2024-12-09 16:28:46 +00:00
Christoph Auer
7972d47f88
fix: Call into docling-core for legacy document transform (#551)
Call into docling-core for legacy document transform

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 17:06:47 +01:00
Nikos Livathinos
78f61a8522
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
github-actions[bot]
9fd2cf847a chore: bump version to 2.9.0 [skip ci] 2024-12-09 09:33:55 +00:00
Panos Vagenas
c8ecdd987e
feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Michele Dolfi
53039a8367
ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
9102fe1adc
fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
e780333440
docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
0d11e30dd8
fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Michele Dolfi
c830b92b2e
fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 09:33:39 +01:00
Michele Dolfi
8ada0bccc7
fix: folder input in cli (#511)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-04 14:22:00 +01:00
github-actions[bot]
9c788ae778 chore: bump version to 2.8.3 [skip ci] 2024-12-03 15:16:47 +00:00
Christoph Auer
34c7c79858
fix: improve handling of disallowed formats (#429)
* fix: Fixes and tests for StopIteration on .convert()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Remove unnecessary case handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Other test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* improve handling of unsupported types

- Introduced new explicit exception types instead of `RuntimeError`
- Introduced new `ConversionStatus` value for unsupported formats
- Tidied up converter member typing & removed asserts

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* robustify & simplify format option resolution

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* rename new status, populate ConversionResult errors

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-03 12:45:32 +01:00
github-actions[bot]
2254845da3 chore: bump version to 2.8.2 [skip ci] 2024-12-03 10:47:29 +00:00
Michele Dolfi
672962a8b2
chore: update numpy lock (#500)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:21:31 +01:00
guglie
c90c41c391
fix: ParserError EOF inside string (#470) (#472)
Signed-off-by: guglie <gdguglie@gmail.com>
2024-12-03 11:21:18 +01:00
Michele Dolfi
5ba3807f31
docs: add styling for faq (#502)
* docs: add styling to faq

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torchaudio

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-03 11:20:49 +01:00
Panos Vagenas
051789d017
perf: prevent temp file leftovers, reuse core type (#487)
* chore: reuse DocumentStream from docling-core

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update docling-core version

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* [skip ci] document  import line

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490)

use new resolve_source_to_x functions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-12-03 10:40:28 +01:00
Gaspard Petit
d3f84b2457
fix: PermissionError when using tesseract_ocr_cli_model (#496)
Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>
2024-12-03 10:22:03 +01:00
Álvaro Huertas
33cff98d36
docs: typo in faq (#484)
Typo faq.md

Signed-off-by: Álvaro Huertas <123009293+huertin03@users.noreply.github.com>
2024-12-02 10:35:24 +01:00
Michele Dolfi
d4872103b8
docs: add automatic api reference (#475)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-02 09:55:52 +01:00
Michele Dolfi
8ccb3c6db6
docs: introduce faq section (#468)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 22:34:56 +01:00
github-actions[bot]
cc46c938b6 chore: bump version to 2.8.1 [skip ci] 2024-11-29 13:04:48 +00:00
Michele Dolfi
dd8de46267
fix(cli): expose debug options (#467)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 13:25:58 +01:00
Michele Dolfi
af63818df5
fix: remove unused deps (#466)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-29 13:18:06 +01:00
Panos Vagenas
84c46fdeb3
docs: extend integration docs & README (#456)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-28 09:41:21 +01:00
github-actions[bot]
211f4f7570 chore: bump version to 2.8.0 [skip ci] 2024-11-27 13:29:32 +00:00
Swaymaw
85b29990be
feat(ocr): added support for RapidOCR engine (#415)
* adding rapidocr engine for ocr in docling

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>

* fixing styling format

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* updating pyproject.toml and poetry.lock to fix ci bugs

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* help poetry pinning for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* simplifying rapidocr options so that device can be changed using a single option for all models

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* fix styling issues and small bug in rapidOcrOptions

Signed-off-by: Swaymaw <swaymaw@gmail.com>

* use default device until we enable global management

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: swayam-singhal <swayam.singhal@inito.com>
Signed-off-by: Swaymaw <swaymaw@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: swayam-singhal <swayam.singhal@inito.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-27 13:57:41 +01:00
Manuel030
767563bf8b
fix: use correct image index in word backend (#442)
* fix image index in word backend

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* fix: Fixes for wordx (#432)

* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* sign dco

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

* correct rebase error

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>

---------

Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-27 13:45:07 +01:00
Christoph Auer
29807a2d68
fix: Update tests and examples for docling-core 2.5.1 (#449)
* Update tests for docling-core 2.5.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add export with referenced images to export_figures example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Revert "Fix OCR tests"

This reverts commit 12b575946f51950fcacece99d4d6eb682125d779.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile for docling-core 2.5.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-11-27 13:07:00 +01:00
github-actions[bot]
6666d9ec07 chore: bump version to 2.7.1 [skip ci] 2024-11-26 15:01:33 +00:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx (#432)
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Michele Dolfi
d7072b4b56
fix: force pydantic < 2.10.0 (#407)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-22 08:23:11 +01:00
Peter W. J. Staar
2a1d3fd221
chore: update the README (#409)
* chore: update the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update README.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* chore: update the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:28:53 +01:00
Panos Vagenas
7a45b92078
docs: add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:23:04 +01:00
Michele Dolfi
97d571af97
chore: add downloads in README, security policy and update ci actions (#401)
* add security policy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deprecated actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add comment about licenses for new dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pypi downloads badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add citation file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-21 13:59:45 +01:00
github-actions[bot]
eb64f6d368 chore: bump version to 2.7.0 [skip ci] 2024-11-20 15:36:51 +00:00
Michele Dolfi
7b013abcf3
fix: python3.9 support (#396)
* fixes for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse with python3.9 wheels

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 15:21:40 +01:00