Docling/docling
Cesar Berrospi Ramis a112d7a035
fix: parse html with omitted body tag (#818)
* fix: parse HTML files without body tag

Parse HTML files without 'body' tag, since it is optional in HTML5 specification.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* test: ensure docling converts HTML without body tag

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-01-27 16:59:00 +01:00
..
backend fix: parse html with omitted body tag (#818) 2025-01-27 16:59:00 +01:00
chunking feat: expose new hybrid chunker, update docs (#384) 2024-12-09 08:28:29 +01:00
cli feat: add platform info to CLI version printout (#816) 2025-01-27 16:04:57 +01:00
datamodel feat(ocr): expose rec_keys_path in RapidOcrOptions to support custom dictionaries (#786) 2025-01-27 13:38:15 +01:00
models feat(ocr): expose rec_keys_path in RapidOcrOptions to support custom dictionaries (#786) 2025-01-27 13:38:15 +01:00
pipeline feat: New document picture classifier (#805) 2025-01-24 18:05:51 +01:00
utils feat: Introduce automatic language detection in TesseractOcrCliModel (#800) 2025-01-26 08:07:56 +01:00
__init__.py Initial commit 2024-07-15 09:42:42 +02:00
document_converter.py feat: add Docling JSON ingestion (#783) 2025-01-24 18:05:23 +01:00
exceptions.py fix: improve handling of disallowed formats (#429) 2024-12-03 12:45:32 +01:00
py.typed fix: Add py.typed marker file (#531) 2024-12-06 13:42:14 +01:00