Qiefan Jiang
13865c06f5
perf(msexcel): _find_table_bounds use iter_rows/iter_cols instead of Worksheet.cell ( #1875 )
...
* perf(msexcel): _find_table_bounds use iter_rows/iter_cols instead of sheet.cell
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: 274102a8d4db5d2da8c7ca603e1eb039c1e07967
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* fix lint
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: b6b5b090a99ba7ba23c1facf0317f7e9f95039e5
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
---------
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
2025-07-03 13:12:06 +02:00
Christoph Auer
bdfee4e2d0
chore: Safer unloading of DPv4 backend ( #1867 )
...
fix: Safer unloading of DPv4 backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-30 14:41:21 +02:00
Panos Vagenas
0533da1923
feat: leverage new list modeling, capture default markers ( #1856 )
...
* chore: update docling-core & regenerate test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update backends to leverage new list modeling
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* ensure availability of latest docling-core API
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-27 16:37:15 +02:00
Michael Honaker
e79e4f0ab6
fix(markdown): make parsing of rich table cells valid ( #1821 )
...
* fix: update md table classification
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix ground truth header changes
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix merge issues
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix minor ground truth errors
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
---------
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
2025-06-26 19:50:45 +02:00
Panos Vagenas
7c5614a37a
fix(markdown): fix single-formatted headings & list items ( #1820 )
...
* fix(markdown): fix formatting & inline edge cases (show behavior before change)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* add change and updated test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* improve test case
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-25 13:05:06 +02:00
Allen N.
4002de1f92
fix: Handle missing runs to avoid out of range exception ( #1844 )
...
Fixes #1681 on upstream
Signed-off-by: Allen Nikka <allennikka@gmail.com >
2025-06-25 07:55:27 +02:00
Peter W. J. Staar
1557e7ce3e
feat: Support audio input ( #1763 )
...
* scaffolding in place
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* doing scaffolding for audio pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* WIP: got first transcription working
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, time to start cleaning up
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working ASR pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added openai-whisper as a first transcription model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updating with asr_options
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalised the first working ASR pipeline with Whisper
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use whisper from the latest git commit
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* updated comment
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* AudioBackend -> DummyBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* file rename
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename to NoOpBackend, add test for ASR pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Support every format in NoOpBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add missing audio file and test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Install ffmpeg system dependency for ASR test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-23 14:47:26 +02:00
Cesar Berrospi Ramis
d26dac61a8
fix(docx): ensure list items have a list parent ( #1827 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-20 14:47:25 +02:00
mkrssg
1350a8d3e5
fix(msword_backend): Identify text in the same line after an image #1425 ( #1610 )
...
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com >
2025-06-20 10:55:30 +02:00
Panos Vagenas
861abcdcb0
feat(markdown): add formatting & improve inline support ( #1804 )
...
feat(markdown): support formatting & hyperlinks
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-18 15:57:57 +02:00
Martin Wind
f28d23cf03
fix: pptx line break and space handling ( #1664 )
...
Signed-off-by: Martin Wind <martin.wind@im-c.at >
2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis
b886e4df31
fix(asciidoc): set default size when missing in image directive ( #1769 )
...
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-16 10:38:46 +02:00
Christoph Auer
7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it ( #1745 )
...
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Make page.parsed_page the only source of truth for text cells
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Small fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Correctly compute PDF boxes from pymupdf
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use different OCR engine order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add type hints and fix mypy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* One more test fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Remove with pypdfium2_lock from caller sites
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix typing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-13 19:01:55 +02:00
Bruno Rigal
7a275c7637
fix: Handle NoneType error in MsPowerpointDocumentBackend ( #1747 )
...
fix:nonetyperror in pptx backend
Signed-off-by: Bruno Rigal <bruno.rigal@probayes.com >
Co-authored-by: Bruno Rigal <bruno.rigal@probayes.com >
2025-06-10 19:43:20 +02:00
AndrewTsai0406
9dbcb3d7d4
fix: Improve extraction from textboxes in Word docs ( #1701 )
...
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
---------
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
2025-06-06 11:37:46 +02:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-22 13:43:33 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-28 08:52:09 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-04-02 12:56:44 +02:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes ( #1241 )
...
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level ( #1244 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 10:30:23 +01:00
Rafael Teixeira de Lima
f739d0e4c5
fix(docx): identifying numbered headers ( #1231 )
...
* Modifications to identify numbered headers
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add style check
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-25 11:41:02 +01:00
Maciej Wieczorek
b454aa1551
feat: Add PPTX notes slides ( #474 )
...
* feat: Add PPTX notes slides
Presenter notes may have useful information and should also be extracted.
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
* feat: Move presenter notes into furniture
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
---------
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724
fix: Determine correct page size in DoclingParseV4Backend ( #1196 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 11:05:42 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling ( #1194 )
...
* Fixing function return
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add message
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-19 10:34:25 +01:00
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API ( #905 )
...
* Add DoclingParseV3 backend implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use docling-core with docling-parse types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes and test updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test units
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add back DoclingParse v1 backend, pipeline options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update locks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Ground-truth files updated
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update tests, use TextCell.from_ocr property
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Text fixes, new test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename docling backend to v4
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Test all backends, fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset all tests to use docling-parse v1 for now
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for DPv4 backend init, better test coverage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* test_input_doc use default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-18 10:38:19 +01:00
Cesar Berrospi Ramis
f94da44ec5
fix(html): handle nested empty lists ( #1154 )
...
Address the case of nested lists in empty list items.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 16:56:58 +01:00
Panos Vagenas
0945973b79
fix: use first table row as col headers ( #1156 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima
6eb718f849
feat: equations to latex in MSWord backend (with inline groups) ( #1114 )
...
* Equation groups
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Proper handling of orphan IDs in layout postprocessing (#1118 )
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.25.2 [skip ci]
* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124 )
add env var in docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(CLI): fix help message for abort options (#1130 )
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* perf: New revision code formula model and document picture classifier (#1140 )
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: Use new TableFormer model weights and default to accurate model version (#1100 )
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.26.0 [skip ci]
* fix: Pass tests, update docling-core to 2.22.0 (#1150 )
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Updating content hash
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
aa92a57fa9
fix: Pass tests, update docling-core to 2.22.0 ( #1150 )
...
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 09:45:55 +01:00
Michele Dolfi
8dc0562542
fix: enable locks for threadsafe pdfium ( #1052 )
...
* enable locks for threadsafe pdfium
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix deadlock in pypdfium2 backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-03-02 20:06:44 +01:00
Peter W. J. Staar
e25d557c06
refactor: add the contentlayer to html-backend ( #1040 )
...
* added the contentlayer to html-backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the handle_image function
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code of html backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* test(html): add more info if a test case fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor(html): put parsed item in body if doc has no header
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: set TextItem label to 'text' instead of 'paragraph'
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis
de7b963b09
fix(html): use 'start' attribute when parsing ordered lists from HTML docs ( #1062 )
...
* fix(html): use 'start' attribute in ordered lists
When parsing ordered lists in HTML, take into account the 'start' attribute if it exists.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(html): reduce verbosity in HTML backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-27 09:46:57 +01:00
Cesar Berrospi Ramis
1b0ead6907
fix(html): Parse text in div elements as TextItem ( #1041 )
...
feat(html): Parse text in div elements as TextItem
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-24 12:38:29 +01:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON ( #1027 )
...
* test: avoid testing exact JSON
Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Update tests/test_backend_patent_uspto.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2025-02-20 16:20:07 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints ( #999 )
...
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-18 11:30:47 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents ( #967 )
...
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-17 10:43:31 +01:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
...
* feat: Implement csv backend and format detection
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* test: Implement csv parsing and format tests
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* docs: Add example and CSV format documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* feat: Add support for various CSV dialects and update documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* feat: Add validation for delimiters and tests for inconsistent csv files
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
---------
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
2025-02-14 08:55:09 +01:00
Panos Vagenas
90b766e2ae
fix(markdown): handle nested lists ( #910 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-02-07 12:55:12 +01:00