Cesar Berrospi Ramis
b886e4df31
fix(asciidoc): set default size when missing in image directive ( #1769 )
...
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-16 10:38:46 +02:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-04-14 18:01:26 +02:00
Panos Vagenas
0945973b79
fix: use first table row as col headers ( #1156 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-13 15:34:18 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown ( #752 )
...
* propagated changes for new CodeItem class
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* Rebased branch on latest main. changes for CodeItem
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed unused files
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* chore: update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* pin latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docling-core pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use new add_code in backends and update typing in MD backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* added if statement for backend
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed unused import
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* removed print statements
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* gt for new pdf
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* Update docling/pipeline/standard_pdf_pipeline.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
* fixed doc comment of __call__ function of code_formula_model
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
* fix artifacts_path type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move expansion_factor to base class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-24 16:54:22 +01:00
Christoph Auer
2a2c65bf4f
feat: Add pipeline timings and toggle visualization, establish debug settings ( #183 )
...
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-30 15:04:19 +01:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00