Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-22 13:43:33 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-28 08:52:09 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00
Maxim Lysak
dc3bf9ceac
fix(pptx): check if picture shape has an image attached ( #1316 )
...
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-07 17:36:56 +02:00
Simon Jégou
bfcab3d677
feat(docx): add text formatting and hyperlink support ( #630 )
...
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima
d2d68747f9
fix(docx): Improve text parsing ( #1268 )
...
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
2025-04-02 12:56:44 +02:00
Panos Vagenas
9210812bfa
fix: improve HTML layer detection, various MD fixes ( #1241 )
...
Markdown fixes:
- properly propagate section header levels
- improve handling of list subroots without text
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 16:07:14 +01:00
Panos Vagenas
85c4df887b
fix(html): fix HTML parsed heading level ( #1244 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-26 10:30:23 +01:00
Rafael Teixeira de Lima
f739d0e4c5
fix(docx): identifying numbered headers ( #1231 )
...
* Modifications to identify numbered headers
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add style check
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-25 11:41:02 +01:00
Maciej Wieczorek
b454aa1551
feat: Add PPTX notes slides ( #474 )
...
* feat: Add PPTX notes slides
Presenter notes may have useful information and should also be extracted.
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
* feat: Move presenter notes into furniture
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
---------
Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co >
2025-03-19 14:52:09 +01:00
Christoph Auer
f5adfb9724
fix: Determine correct page size in DoclingParseV4Backend ( #1196 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-03-19 11:05:42 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling ( #1194 )
...
* Fixing function return
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Add message
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
2025-03-19 10:34:25 +01:00
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API ( #905 )
...
* Add DoclingParseV3 backend implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use docling-core with docling-parse types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes and test updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* update test units
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add back DoclingParse v1 backend, pipeline options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update locks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Ground-truth files updated
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update tests, use TextCell.from_ocr property
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Text fixes, new test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename docling backend to v4
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Test all backends, fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Reset all tests to use docling-parse v1 for now
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fixes for DPv4 backend init, better test coverage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* test_input_doc use default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-18 10:38:19 +01:00
Cesar Berrospi Ramis
f94da44ec5
fix(html): handle nested empty lists ( #1154 )
...
Address the case of nested lists in empty list items.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 16:56:58 +01:00
Panos Vagenas
0945973b79
fix: use first table row as col headers ( #1156 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima
6eb718f849
feat: equations to latex in MSWord backend (with inline groups) ( #1114 )
...
* Equation groups
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Proper handling of orphan IDs in layout postprocessing (#1118 )
* Fix the handling of orphan IDs in layout postprocessing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.25.2 [skip ci]
* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124 )
add env var in docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(CLI): fix help message for abort options (#1130 )
fix help message
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* perf: New revision code formula model and document picture classifier (#1140 )
* new version code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new version document picture classifier
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* new code formula model
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
* restored original code formula test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: Use new TableFormer model weights and default to accurate model version (#1100 )
* feat: New tableformer model weights [WIP]
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Updated TF version
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated tests, after merging with Main, Switched to Accurate TF model by default
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: bump version to 2.26.0 [skip ci]
* fix: Pass tests, update docling-core to 2.22.0 (#1150 )
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Updating content hash
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com >
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
aa92a57fa9
fix: Pass tests, update docling-core to 2.22.0 ( #1150 )
...
fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-13 09:45:55 +01:00
Michele Dolfi
8dc0562542
fix: enable locks for threadsafe pdfium ( #1052 )
...
* enable locks for threadsafe pdfium
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* fix deadlock in pypdfium2 backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-03-02 20:06:44 +01:00
Peter W. J. Staar
e25d557c06
refactor: add the contentlayer to html-backend ( #1040 )
...
* added the contentlayer to html-backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the handle_image function
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code of html backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* test(html): add more info if a test case fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor(html): put parsed item in body if doc has no header
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: set TextItem label to 'text' instead of 'paragraph'
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis
de7b963b09
fix(html): use 'start' attribute when parsing ordered lists from HTML docs ( #1062 )
...
* fix(html): use 'start' attribute in ordered lists
When parsing ordered lists in HTML, take into account the 'start' attribute if it exists.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(html): reduce verbosity in HTML backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-27 09:46:57 +01:00
Cesar Berrospi Ramis
1b0ead6907
fix(html): Parse text in div elements as TextItem ( #1041 )
...
feat(html): Parse text in div elements as TextItem
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-24 12:38:29 +01:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON ( #1027 )
...
* test: avoid testing exact JSON
Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Update tests/test_backend_patent_uspto.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2025-02-20 16:20:07 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints ( #999 )
...
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-18 11:30:47 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents ( #967 )
...
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-17 10:43:31 +01:00
Tobias Strebitzer
00d9405b0a
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
...
* feat: Implement csv backend and format detection
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* test: Implement csv parsing and format tests
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* docs: Add example and CSV format documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* feat: Add support for various CSV dialects and update documentation
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
* feat: Add validation for delimiters and tests for inconsistent csv files
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
---------
Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com >
2025-02-14 08:55:09 +01:00
Panos Vagenas
90b766e2ae
fix(markdown): handle nested lists ( #910 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-02-07 12:55:12 +01:00
Vladimir Gurevich
722a6eb7b9
fix(msword_backend): handle conversion error in label parsing ( #896 )
...
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.
Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com >
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com >
2025-02-06 12:30:51 +01:00
Panos Vagenas
5ac2887e4a
fix(markdown): fix parsing if doc ending with table ( #873 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-02-03 14:38:38 +01:00
Panos Vagenas
94751a78f4
fix(markdown): add support for HTML content ( #855 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-02-03 12:21:05 +01:00
Cesar Berrospi Ramis
0cd81a8122
fix(docx): merged table cells not properly converted ( #857 )
...
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-03 10:20:03 +01:00
Maxim Lysak
eff16b62cc
fix: Processing of placeholder shapes in pptx that have text but no bbox ( #868 )
...
Processing of placeholder shapes in pptx that have text but no bbox
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-02-03 09:33:33 +01:00
Maxim Lysak
d727b04ad0
feat(docx): Support of SDTs in docx backend ( #853 )
...
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-01-31 14:52:24 +01:00
Maxim Lysak
2c037ae62e
fix: Fixed docx import with headers that are also lists ( #842 )
...
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-31 10:51:21 +01:00
Michele Dolfi
2a1f8afe7e
fix: use new add_code in html backend and add more typing hints ( #850 )
...
fix add_code in html backend and add more typing hints
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-31 09:54:17 +01:00
Panos Vagenas
bccb022fc8
fix(markdown): fix empty block handling ( #843 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-01-30 16:22:29 +01:00
Maxim Lysak
fea0a99a95
fix: Fix for the crash when encountering WMF images in pptx and docx ( #837 )
...
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-01-30 14:58:27 +01:00
Panos Vagenas
5aed9f8aeb
fix: fix single newline handling in MD backend ( #824 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-27 16:59:00 +01:00
Cesar Berrospi Ramis
c2ae1cc4ca
docs: description of supported formats and backends ( #788 )
...
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-26 08:10:33 +01:00
Peter W. J. Staar
a458e298ca
fix: added extraction of byte-images in excel ( #804 )
...
* fix(msexcel): ignore Mypy checking for _find_images_in_sheet function
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
* fixed some issues
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pinned pillow in pyproject
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local >
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-01-24 18:48:02 +01:00
Panos Vagenas
88a0e66adc
feat: add Docling JSON ingestion ( #783 )
...
* feat: add Docling JSON ingestion
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
* Update docling/backend/json/docling_json_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-24 18:05:23 +01:00