Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-27 16:59:00 +01:00
Cesar Berrospi Ramis
c2ae1cc4ca
docs: description of supported formats and backends ( #788 )
...
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-26 08:10:33 +01:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel ( #804 )
...
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
* fixed some issues
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pinned pillow in pyproject
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-24 18:48:02 +01:00
Panos Vagenas
88a0e66adc
feat: add Docling JSON ingestion ( #783 )
...
* feat: add Docling JSON ingestion
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* Update docling/backend/json/docling_json_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-24 18:05:23 +01:00
Matteo
3213b247ad
feat: Code and equation model for PDF and code blocks in markdown ( #752 )
...
* propagated changes for new CodeItem class
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* Rebased branch on latest main. changes for CodeItem
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed unused files
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* chore: update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* pin latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update docling-core pinning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* pin docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use new add_code in backends and update typing in MD backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* added if statement for backend
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed unused import
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* removed print statements
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* gt for new pdf
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* Update docling/pipeline/standard_pdf_pipeline.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
* fixed doc comment of __call__ function of code_formula_model
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
* fix artifacts_path type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move expansion_factor to base class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2025-01-24 16:54:22 +01:00
Michele Dolfi
57fc28d3d8
refactor: allow the usage of backends in the enrich models and generalize the interface ( #742 )
...
* fix get image with cropbox
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* allow the usage of backends in the enrich models and generalize the interface
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move logic in BaseTextImageEnrichmentModel
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* renaming
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-15 09:52:38 +01:00
Christoph Auer
5a060f237d
fix: Improve OCR results, stricten criteria before dropping bitmap areas ( #719 )
...
fix: Properly care for all bitmap elements in OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-01-10 10:38:49 +01:00
Christoph Auer
42856fdf79
fix: Let BeautifulSoup detect the HTML encoding ( #695 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-01-07 15:49:28 +01:00
Jinfeng Sun
d49650c54f
fix(mspowerpoint): handle invalid images in PowerPoint slides ( #650 )
...
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load
Signed-off-by: Tendo33 <sjf1998112@gmail.com >
2025-01-07 13:58:10 +01:00
Lucas Morin
fd034802b6
feat: Create a backend to transform PubMed XML files to DoclingDocument ( #557 )
...
Signed-off-by: lucas-morin <lucas.morin222@gmail.com >
2024-12-17 19:27:09 +01:00
Cesar Berrospi Ramis
4e087504cc
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
...
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2024-12-17 16:35:23 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend ( #549 )
...
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-12-09 13:26:17 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend ( #537 )
...
Correcting DefaultText ID for MS Word backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic ( #534 )
...
Using style id instead of style names, which should be localization agnostic
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 15:17:56 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table ( #528 )
...
Fix for missing text in docx (t tag) when embedded in a table
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-12-06 12:37:25 +01:00
Manuel030
767563bf8b
fix: use correct image index in word backend ( #442 )
...
* fix image index in word backend
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* fix: Fixes for wordx (#432 )
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* sign dco
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
* correct rebase error
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
---------
Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-27 13:45:07 +01:00
Maxim Lysak
d0a1180478
fix: Fixes for wordx ( #432 )
...
* fixes for referencing drawing blip in wordx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated lxml dependency version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-26 14:44:43 +01:00
Peter W. J. Staar
926dfd29d5
feat: added excel backend ( #334 )
...
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2024-11-19 12:21:17 +01:00
Maxim Lysak
7a97d7119f
feat: Extracting picture data for raster images found in PPTX ( #349 )
...
* Added picture data for pptx pictures
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added tests for pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Inferring image DPI from pptx file
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-18 15:22:28 +01:00
Maxim Lysak
8533039b0c
fix: Fixing images in the input Word files ( #330 )
...
* Fixing images identification in the input Word files
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Populating extracted image data into docling picture for wordx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed base64 dependency in msword_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-14 13:33:34 +01:00
Maxim Lysak
fb8ba861e2
fix: Handling of single-cell tables in DOCX backend ( #314 )
...
* Handling of single-cell tables in DOCX backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* returned try-catch on tables handling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* proceed processing the content of single cell table as if its just part of the body
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added example of trickly 1 cell table docx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-12 15:20:55 +01:00
Maxim Lysak
81c8243a8b
fix: Added handling of grouped elements in pptx backend ( #307 )
...
* Added handling of grouped elements in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* updated log.warn to warning
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 16:38:21 +01:00
Maxim Lysak
53bf2d1790
Added handling of code blocks in html with <pre> tag ( #302 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-11-11 15:00:11 +01:00
Ikko Eltociear Ashimine
c3098e3c12
chore: fix typo ( #241 )
...
* chore: update pypdfium2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
* chore: update docling_parse_v2_backend.py
occured -> occurred
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
---------
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com >
2024-11-05 16:20:04 +01:00
Christoph Auer
2a2c65bf4f
feat: Add pipeline timings and toggle visualization, establish debug settings ( #183 )
...
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-10-30 15:04:19 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-28 17:02:52 +01:00
Maxim Lysak
94d0729c50
fix: handling of long sequence of unescaped underscore chars in markdown ( #173 )
...
* Fix for md hanging when encountering long sequence of unescaped underscore chars
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added comment explaining reason for fix
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed trailing inline text handling (at the end of a file), and corrected underscore sequence shortening
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* making fix more rare
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-28 16:34:48 +01:00
Maxim Lysak
7d19418b77
fix: HTML backend, fixes for Lists and nested texts ( #180 )
...
* Fixes for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 20:14:04 +02:00
Maxim Lysak
88c1673057
fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers ( #178 )
...
* Small fix to properly handle trailing inline text in the md backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper handling of headers with bold, italic or emphasis
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* removed print
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Made smarter processing of headers, with arbitrary styling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated docling-core to 2.2.1
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests because of the change in Markdown export in docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-25 18:02:20 +02:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history ( #170 )
...
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update imports for docling_parse.pdf_parser_v1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* repin poetry.lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-23 17:20:11 +02:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Peter Staar <taa@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2024-10-23 16:14:26 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-16 21:02:03 +02:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend ( #131 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-10-11 15:12:49 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) ( #72 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-09-12 15:56:29 +02:00
Christoph Auer
a294b7e64a
feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status ( #47 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 16:18:41 +02:00
Christoph Auer
8808463cec
fix: Better raise exception when a page fails to parse ( #46 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Raise from page backend if page is not correctly parsed
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 13:51:42 +02:00
Christoph Auer
7e84533299
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages ( #45 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-23 12:51:02 +02:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing ( #44 )
...
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-22 13:49:37 +02:00
Michele Dolfi
fac5745dc8
fix: usage of bytesio with docling-parse ( #43 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-22 12:59:49 +02:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering ( #38 )
...
* Introduce adaptive OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Factor out BaseOcrModel, add docling-parse backend tests, fixes
* Make easyocr default dep
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-20 15:28:03 +02:00
Michele Dolfi
78347bf679
feat: allow computing page images on-demand with scale and cache them ( #36 )
...
* feat: allow computing page images on-demand and cache them
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* feat: expose scale for export of page images and document elements
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix comment
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-20 13:27:19 +02:00
Christoph Auer
c253dd743a
Add redbooks to test data, small additions ( #35 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2024-08-20 12:36:00 +02:00
Michele Dolfi
90dd676422
feat: update parser with bytesio interface and set as new default backend ( #32 )
...
* update parser with bytesio interface
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* change default backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update DEFAULT_BACKEND
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-14 12:30:00 +02:00
Michele Dolfi
794b20a50a
fix: type of path_or_stream in PdfDocumentBackend ( #28 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2024-08-07 17:20:44 +02:00
Maxim Lysak
b8f5e38a8c
feat: introducing docling_backend ( #26 )
...
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com >
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com >
2024-08-07 16:22:36 +02:00
mara004
3eca8b8485
refactor(pypdfium2): just forward input to PdfDocument directly ( #17 )
...
PdfDocument() should do accept strings, paths, bytes and byte streams. If not, please file a bug report.
Signed-off-by: mara004 <geisserml@gmail.com >
2024-07-25 08:54:57 +02:00
Christoph Auer
e2d996753b
Initial commit
2024-07-15 09:42:42 +02:00