Docling/tests/data
Maxim Lysak 2f72167ff6
feat: updated vlm pipeline (with latest changes from docling-core) (#1158)
* Draft implementation of Doctag backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated VLM pipeline doctags to docling conversion, now properly supports lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* preparing to migrate to new doctags deserializer

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* re-using DocTagsDocument.from_doctags_and_image_pairs

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* satisfying mypy and other checks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added support for force_backend_text parameter

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed unnecessary transformation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated readme

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-18 15:44:51 +01:00
..
asciidoc fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
csv feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
docx feat: equations to latex in MSWord backend (with inline groups) (#1114) 2025-03-13 15:12:22 +01:00
groundtruth feat: updated vlm pipeline (with latest changes from docling-core) (#1158) 2025-03-18 15:44:51 +01:00
html fix(html): handle nested empty lists (#1154) 2025-03-13 16:56:58 +01:00
jats feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
md fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
pdf fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
pptx feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
uspto feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
xlsx fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
2305.03393v1-pg9-img.png feat!: Docling v2 (#117) 2024-10-16 21:02:03 +02:00