Maxim Lysak
d0a1180478
fix: Fixes for wordx ( #432 )
...
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-26 14:44:43 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-14 13:33:34 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-11-12 15:20:55 +01:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-28 17:02:52 +01:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00