Panos Vagenas
|
0533da1923
|
feat: leverage new list modeling, capture default markers (#1856)
* chore: update docling-core & regenerate test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update backends to leverage new list modeling
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* ensure availability of latest docling-core API
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
|
2025-06-27 16:37:15 +02:00 |
|
Panos Vagenas
|
7c5614a37a
|
fix(markdown): fix single-formatted headings & list items (#1820)
* fix(markdown): fix formatting & inline edge cases (show behavior before change)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* add change and updated test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* update lock
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
* improve test case
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
|
2025-06-25 13:05:06 +02:00 |
|
Cesar Berrospi Ramis
|
d26dac61a8
|
fix(docx): ensure list items have a list parent (#1827)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
|
2025-06-20 14:47:25 +02:00 |
|
Clément Doumouro
|
45265bf8b1
|
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
|
2025-05-21 18:12:33 +02:00 |
|
Simon Jégou
|
bfcab3d677
|
feat(docx): add text formatting and hyperlink support (#630)
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com>
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com>
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
---------
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
|
2025-04-03 15:11:50 +02:00 |
|