Peter W. J. Staar
e25d557c06
refactor: add the contentlayer to html-backend ( #1040 )
...
* added the contentlayer to html-backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the handle_image function
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code of html backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* test(html): add more info if a test case fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* refactor(html): put parsed item in body if doc has no header
In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: set TextItem label to 'text' instead of 'paragraph'
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis
de7b963b09
fix(html): use 'start' attribute when parsing ordered lists from HTML docs ( #1062 )
...
* fix(html): use 'start' attribute in ordered lists
When parsing ordered lists in HTML, take into account the 'start' attribute if it exists.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore(html): reduce verbosity in HTML backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-02-27 09:46:57 +01:00
Suehtam
1d17e7397a
test: avoid testing exact JSON in CSV backend ( #1038 )
...
* feat: updated verify_export
Moved verify_export to verify_utils
Reuse verify_export in tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com >
* feat: replace verify_export with verify_document in CSV conversion tests
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com >
---------
Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com >
2025-02-24 08:10:40 +01:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON ( #1027 )
...
* test: avoid testing exact JSON
Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* Update tests/test_backend_patent_uspto.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
2025-02-20 16:20:07 +01:00
Cesar Berrospi Ramis
a112d7a035
fix: parse html with omitted body tag ( #818 )
...
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-01-27 16:59:00 +01:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
...
* add real e2e tests for html and docx
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the output of itxt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the text
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the tests (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the examples (1)
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the output of the test
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the tests, moved the ground-truth
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* moved the ground-truth data
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the html tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restructure title fix (#187 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-30 13:14:56 +01:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
2024-10-28 17:02:52 +01:00