Cesar Berrospi Ramis
0cd81a8122
fix(docx): merged table cells not properly converted ( #857 )
...
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-03 10:20:03 +01:00
Maxim Lysak
eff16b62cc
fix: Processing of placeholder shapes in pptx that have text but no bbox ( #868 )
...
Processing of placeholder shapes in pptx that have text but no bbox
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-02-03 09:33:33 +01:00
Maxim Lysak
d727b04ad0
feat(docx): Support of SDTs in docx backend ( #853 )
...
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-01-31 14:52:24 +01:00
Maxim Lysak
2c037ae62e
fix: Fixed docx import with headers that are also lists ( #842 )
...
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-31 10:51:21 +01:00
Michele Dolfi
2a1f8afe7e
fix: use new add_code in html backend and add more typing hints ( #850 )
...
fix add_code in html backend and add more typing hints
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-31 09:54:17 +01:00
Panos Vagenas
bccb022fc8
fix(markdown): fix empty block handling ( #843 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-01-30 16:22:29 +01:00
Maxim Lysak
fea0a99a95
fix: Fix for the crash when encountering WMF images in pptx and docx ( #837 )
...
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-01-30 14:58:27 +01:00
Panos Vagenas
5aed9f8aeb
fix: fix single newline handling in MD backend ( #824 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-27 16:59:00 +01:00
Cesar Berrospi Ramis
c2ae1cc4ca
docs: description of supported formats and backends ( #788 )
...
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-26 08:10:33 +01:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel ( #804 )
...
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
* fixed some issues
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pinned pillow in pyproject
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-24 18:48:02 +01:00
Panos Vagenas
88a0e66adc
feat: add Docling JSON ingestion ( #783 )
...
* feat: add Docling JSON ingestion
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* Update docling/backend/json/docling_json_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-24 18:05:23 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown ( #752 )
...
* propagated changes for new CodeItem class
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* Rebased branch on latest main. changes for CodeItem
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed unused files
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* chore: update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* pin latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update docling-core pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* pin docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use new add_code in backends and update typing in MD backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* added if statement for backend
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed unused import
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed print statements
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* gt for new pdf
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* Update docling/pipeline/standard_pdf_pipeline.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
* fixed doc comment of __call__ function of code_formula_model
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* fix artifacts_path type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move expansion_factor to base class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2025-01-24 16:54:22 +01:00
Michele Dolfi
57fc28d3d8
refactor: allow the usage of backends in the enrich models and generalize the interface ( #742 )
...
* fix get image with cropbox
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* allow the usage of backends in the enrich models and generalize the interface
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move logic in BaseTextImageEnrichmentModel
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-15 09:52:38 +01:00
Christoph Auer
5a060f237d
fix: Improve OCR results, stricten criteria before dropping bitmap areas ( #719 )
...
fix: Properly care for all bitmap elements in OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-01-10 10:38:49 +01:00
Christoph Auer
42856fdf79
fix: Let BeautifulSoup detect the HTML encoding ( #695 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-01-07 15:49:28 +01:00
Jinfeng Sun
d49650c54f
fix(mspowerpoint): handle invalid images in PowerPoint slides ( #650 )
...
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load
Signed-off-by: Tendo33 <sjf1998112@gmail.com >
2025-01-07 13:58:10 +01:00
Lucas Morin
fd034802b6
feat: Create a backend to transform PubMed XML files to DoclingDocument ( #557 )
...
Signed-off-by: lucas-morin <lucas.morin222@gmail.com >
2024-12-17 19:27:09 +01:00
Cesar Berrospi Ramis
4e087504cc
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
...
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2024-12-17 16:35:23 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend ( #549 )
...
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-12-09 13:26:17 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend ( #537 )
...
Correcting DefaultText ID for MS Word backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic ( #534 )
...
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 15:17:56 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table ( #528 )
...
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 12:37:25 +01:00
Manuel030
767563bf8b
fix: use correct image index in word backend ( #442 )
...
* fix image index in word backend
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* fix: Fixes for wordx (#432 )
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* sign dco
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* correct rebase error
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
---------
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-27 13:45:07 +01:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx ( #432 )
...
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-26 14:44:43 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend ( #334 )
...
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2024-11-19 12:21:17 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX ( #349 )
...
* Added picture data for pptx pictures
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added tests for pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Inferring image DPI from pptx file
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-18 15:22:28 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-14 13:33:34 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-12 15:20:55 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 15:00:11 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
2024-11-05 16:20:04 +01:00
Christoph Auer
2a2c65bf4f
feat: Add pipeline timings and toggle visualization, establish debug settings ( #183 )
...
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-30 15:04:19 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-28 17:02:52 +01:00
Maxim Lysak
94d0729c50
fix: handling of long sequence of unescaped underscore chars in markdown ( #173 )
...
* Fix for md hanging when encountering long sequence of unescaped underscore chars
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added comment explaining reason for fix
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed trailing inline text handling (at the end of a file), and corrected underscore sequence shortening
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* making fix more rare
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-28 16:34:48 +01:00
Maxim Lysak
7d19418b77
fix: HTML backend, fixes for Lists and nested texts ( #180 )
...
* Fixes for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 20:14:04 +02:00
Maxim Lysak
88c1673057
fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers ( #178 )
...
* Small fix to properly handle trailing inline text in the md backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper handling of headers with bold, italic or emphasis
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed print
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Made smarter processing of headers, with arbitrary styling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated docling-core to 2.2.1
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests because of the change in Markdown export in docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 18:02:20 +02:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history ( #170 )
...
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update imports for docling_parse.pdf_parser_v1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* repin poetry.lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-23 17:20:11 +02:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-23 16:14:26 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-16 21:02:03 +02:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend ( #131 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 15:12:49 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) ( #72 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-12 15:56:29 +02:00
Christoph Auer
a294b7e64a
feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status ( #47 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 16:18:41 +02:00
Christoph Auer
8808463cec
fix: Better raise exception when a page fails to parse ( #46 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Raise from page backend if page is not correctly parsed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 13:51:42 +02:00
Christoph Auer
7e84533299
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages ( #45 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 12:51:02 +02:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing ( #44 )
...
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-22 13:49:37 +02:00
Michele Dolfi
fac5745dc8
fix: usage of bytesio with docling-parse ( #43 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-22 12:59:49 +02:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering ( #38 )
...
* Introduce adaptive OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Factor out BaseOcrModel, add docling-parse backend tests, fixes
* Make easyocr default dep
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-20 15:28:03 +02:00
Michele Dolfi
78347bf679
feat: allow computing page images on-demand with scale and cache them ( #36 )
...
* feat: allow computing page images on-demand and cache them
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* feat: expose scale for export of page images and document elements
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix comment
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-20 13:27:19 +02:00