Qiefan Jiang
13865c06f5
perf(msexcel): _find_table_bounds use iter_rows/iter_cols instead of Worksheet.cell ( #1875 )
...
* perf(msexcel): _find_table_bounds use iter_rows/iter_cols instead of sheet.cell
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: 274102a8d4db5d2da8c7ca603e1eb039c1e07967
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
* fix lint
* DCO Remediation Commit for Qiefan Jiang <jiangqiefan@bytedance.com >
I, Qiefan Jiang <jiangqiefan@bytedance.com >, hereby add my Signed-off-by to this commit: b6b5b090a99ba7ba23c1facf0317f7e9f95039e5
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
---------
Signed-off-by: Qiefan Jiang <jiangqiefan@bytedance.com >
2025-07-03 13:12:06 +02:00
William Easton
3089cf2d26
perf: Move expensive imports closer to usage ( #1863 )
...
* Move expensive imports closer to usage
Signed-off-by: William Easton <bill.easton@elastic.co >
* DCO Remediation Commit for William Easton <bill.easton@elastic.co >
I, William Easton <bill.easton@elastic.co >, hereby add my Signed-off-by to this commit: 8a7412ce5bb131a01bb6403067aeb948c9093b0b
Signed-off-by: William Easton <bill.easton@elastic.co >
* formatting fixes
Signed-off-by: William Easton <bill.easton@elastic.co >
* DCO Remediation Commit for William Easton <bill.easton@elastic.co >
I, William Easton <bill.easton@elastic.co >, hereby add my Signed-off-by to this commit: 8a7412ce5bb131a01bb6403067aeb948c9093b0b
I, William Easton <bill.easton@elastic.co >, hereby add my Signed-off-by to this commit: 963e34325071db5e844841f10c27b396a054a0a1
Signed-off-by: William Easton <bill.easton@elastic.co >
* Fix baseocrmodel test issue
Signed-off-by: William Easton <bill.easton@elastic.co >
---------
Signed-off-by: William Easton <bill.easton@elastic.co >
2025-07-01 22:27:17 +02:00
Christoph Auer
56a0e104f7
feat: Integrate ListItemMarkerProcessor into document assembly ( #1825 )
...
* Integrate ListItemMarkerProcessor into document assembly
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update to final version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Update all test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Upgrade deps
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-07-01 10:04:58 +02:00
Christoph Auer
bdfee4e2d0
chore: Safer unloading of DPv4 backend ( #1867 )
...
fix: Safer unloading of DPv4 backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-30 14:41:21 +02:00
Nikos Livathinos
ae39a9411a
fix: Ensure that TesseractOcrModel does not crash in case OSD is not installed ( #1866 )
...
fix: Ensure that TesseractOcrModel does not crash if tesseract OSD is not installed
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com >
2025-06-30 10:55:56 +02:00
Panos Vagenas
0533da1923
feat: leverage new list modeling, capture default markers ( #1856 )
...
* chore: update docling-core & regenerate test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update backends to leverage new list modeling
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* repin docling-core
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* ensure availability of latest docling-core API
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-27 16:37:15 +02:00
Michael Honaker
e79e4f0ab6
fix(markdown): make parsing of rich table cells valid ( #1821 )
...
* fix: update md table classification
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix ground truth header changes
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix merge issues
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
* Fix minor ground truth errors
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
---------
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com >
2025-06-26 19:50:45 +02:00
pranaymiri
d337825b8e
fix: updated granite vision model version for picture description ( #1852 )
...
* updated granite model version
* DCO Remediation Commit for Miriyala Pranay <miriyalapranay146@gmail.com >
I, Miriyala Pranay <miriyalapranay146@gmail.com >, hereby add my Signed-off-by to this commit: 5de0d5034c5988613bc1c42a2dab043ba0106956
Signed-off-by: Miriyala Pranay <miriyalapranay146@gmail.com >
---------
Signed-off-by: Miriyala Pranay <miriyalapranay146@gmail.com >
2025-06-25 17:49:56 +02:00
Panos Vagenas
7c5614a37a
fix(markdown): fix single-formatted headings & list items ( #1820 )
...
* fix(markdown): fix formatting & inline edge cases (show behavior before change)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* add change and updated test data
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* improve test case
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-25 13:05:06 +02:00
Michele Dolfi
41e8cae26b
fix: fix response type of ollama ( #1850 )
...
fix response type of ollama
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-25 11:33:09 +02:00
Allen N.
4002de1f92
fix: Handle missing runs to avoid out of range exception ( #1844 )
...
Fixes #1681 on upstream
Signed-off-by: Allen Nikka <allennikka@gmail.com >
2025-06-25 07:55:27 +02:00
Peter W. J. Staar
1557e7ce3e
feat: Support audio input ( #1763 )
...
* scaffolding in place
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* doing scaffolding for audio pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* WIP: got first transcription working
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, time to start cleaning up
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* first working ASR pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added openai-whisper as a first transcription model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updating with asr_options
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalised the first working ASR pipeline with Whisper
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* use whisper from the latest git commit
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* Update docling/datamodel/pipeline_options.py
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
* updated comment
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* AudioBackend -> DummyBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* file rename
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Rename to NoOpBackend, add test for ASR pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Support every format in NoOpBackend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add missing audio file and test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Install ffmpeg system dependency for ASR test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-23 14:47:26 +02:00
Cesar Berrospi Ramis
d26dac61a8
fix(docx): ensure list items have a list parent ( #1827 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-20 14:47:25 +02:00
mkrssg
1350a8d3e5
fix(msword_backend): Identify text in the same line after an image #1425 ( #1610 )
...
* fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
* fix: extraneous empty paragraphs for test files
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
---------
Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com >
Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com >
2025-06-20 10:55:30 +02:00
Christoph Auer
dd7f64ff28
fix: Ensure uninitialized pages are removed before assembling document ( #1812 )
...
Ensure uninitialized pages are removed before assembling document
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-19 07:33:25 +02:00
Panos Vagenas
861abcdcb0
feat(markdown): add formatting & improve inline support ( #1804 )
...
feat(markdown): support formatting & hyperlinks
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-06-18 15:57:57 +02:00
Shkarupa Alex
215b540f6c
feat: Maximum image size for Vlm models ( #1802 )
...
* Image scale moved to base vlm options.
Added max_size image limit (options and vlm models).
* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com >
I, Shkarupa Alex <shkarupa.alex@gmail.com >, hereby add my Signed-off-by to this commit: e93602a0d02fdb6f6dea1f65686cffcc4c616011
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com >
---------
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com >
2025-06-18 12:57:37 +02:00
Mahafuzur Rahman
dbab30e92c
fix: formula conversion with page_range param set ( #1791 )
...
When page_range param is used for formula conversion,
the system throws list index out of range error.
Included tests to validate that the fix works.
Signed-off-by: Masum <masumsofts@yahoo.com >
2025-06-17 13:58:45 +02:00
Martin Wind
f28d23cf03
fix: pptx line break and space handling ( #1664 )
...
Signed-off-by: Martin Wind <martin.wind@im-c.at >
2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis
b886e4df31
fix(asciidoc): set default size when missing in image directive ( #1769 )
...
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-16 10:38:46 +02:00
Christoph Auer
7d3302cb48
feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it ( #1745 )
...
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Make page.parsed_page the only source of truth for text cells
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Small fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Correctly compute PDF boxes from pymupdf
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Use different OCR engine order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add type hints and fix mypy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* One more test fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Remove with pypdfium2_lock from caller sites
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix typing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-13 19:01:55 +02:00
Bruno Rigal
7a275c7637
fix: Handle NoneType error in MsPowerpointDocumentBackend ( #1747 )
...
fix:nonetyperror in pptx backend
Signed-off-by: Bruno Rigal <bruno.rigal@probayes.com >
Co-authored-by: Bruno Rigal <bruno.rigal@probayes.com >
2025-06-10 19:43:20 +02:00
Ayraf
df140227c3
feat: support xlsm files ( #1520 )
...
* code for xlsm support
* updated support for xlsm
* updated code for xlsm support
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update docling_parse_v4_backend.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
updated the tests/test_backend_msexcel_xlsm.py:
have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update base_models.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Update document_converter.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* Delete tests/test_backend_msexcel_xlsm.py
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* xlsm file
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
* run tests
* ran tests
* Fix tests, upgrade XSLM example to a valid file
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b
fix: prov for merged-elems ( #1728 )
...
* fix: prov for merged-elems
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* Reset pyproject.toml
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 11:22:42 +02:00
Maras Ioannis
e979750ce9
fix(tesseract): initialize df_osd to avoid uninitialized variable error ( #1718 )
...
* fix: initialize df_osd to avoid uninitialized variable error
Signed-off-by: IoannisMaras <maras2002@gmail.com >
* Fix formatting
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
* Satisfy mypy, regenerate OCR tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: IoannisMaras <maras2002@gmail.com >
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-06-10 10:57:45 +02:00
Michele Dolfi
f7f31137f1
fix: allow custom torch_dtype in vlm models ( #1735 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-10 10:52:15 +02:00
AndrewTsai0406
9dbcb3d7d4
fix: Improve extraction from textboxes in Word docs ( #1701 )
...
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
* fix/docx_text_box_extraction
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
---------
Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local >
2025-06-06 11:37:46 +02:00
Eugene
a2b83fe4ae
fix: Add WEBP to the list of image file extensions ( #1711 )
...
feat: Add WEBP to the list of image file extensions
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-06-05 09:09:27 +02:00
Peter W. J. Staar
cfdf4cea25
feat: new vlm-models support ( #1570 )
...
* feat: adding new vlm-models support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* got microsoft/Phi-4-multimodal-instruct to work
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* working on vlm's
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the VLM part
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* all working, now serious refacgtoring necessary
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring the download_model
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the formulate_prompt
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* pixtral 12b runs via MLX and native transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the VlmPredictionToken
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* refactoring minimal_vlm_pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the MyPy
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added pipeline_model_specializations file
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* need to get Phi4 working again ...
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* finalising last points for vlms support
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the pipeline for Phi4
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* streamlining all code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixing the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* added the html backend to the VLM pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* fixed the static load_from_doctags
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* restore stable imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use AutoModelForVision2Seq for Pixtral and review example (including rename)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove unused value
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* refactor instances of VLM models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* skip compare example in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use lowercase and uppercase only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename pipeline_vlm_model_spec
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* move more argument to options and simplify model init
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add supported_devices
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove not-needed function
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* exclude minimal_vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* missing file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add message for transformers version
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename to specs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use module import and remove MLX from non-darwin
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove hf_vlm_model and add extra_generation_args
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* use single HF VLM model class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* remove torch type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add docs for vision models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-06-02 17:01:06 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag ( #1673 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings ( #1660 )
...
* chore: fix or catch deprecation warnings
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* chore: update poetry lock with latest docling-core
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-28 17:55:31 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. ( #1663 )
...
* feat: Add visualization of bbox on page with html export.
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* updated the cli argument to show_layout
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte ( #1665 )
...
Update document.py
fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-27 14:06:05 +02:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() ( #1636 )
...
fix ZeroDivisionError for cell_bbox.area()
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract ( #1167 )
...
* fix(ocr): tesseract support mis-oriented documents
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): update missing test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): rotate image to the natural orientation before layout prediction
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): move bounding bow rotation util to orientation.py
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): refactor rotation utilities
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): revert layout updates
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
* chore(ocr): update e2e OCR test data
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`
* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`
* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`
* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation
* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`
---------
Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com >
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages ( #1313 )
...
* Establish confidence field, propagate layout confidence through
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add OCR confidence and parse confidence (stub)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add parse quality rules, use 5% percentile for overall and parse scores
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Heuristic updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Fix garbage regex
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Move grade to page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Introduce mean_score and low_score, consistent aggregate computations
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
* Add confidence test
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-21 12:32:49 +02:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions ( #1609 )
...
fix detecting files with uppercase extensions
Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com >
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage ( #1617 )
...
* fix load_from_doctags usage
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update dependencies
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* fix lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* revert lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
* update lock file
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
---------
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch >
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines ( #1371 )
...
* Fix force_backend_text
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
* empty commit to retrigger CI
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes ( #1549 )
...
get merged_text from boundingbox instead of merging it to prevent overlaps
Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com >
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend ( #1538 )
...
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
* feat: add textbox content extraction in msword_backend
Signed-off-by: Andrew <tsai247365@gmail.com >
---------
Signed-off-by: Andrew <tsai247365@gmail.com >
2025-05-19 15:01:36 +02:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls ( #1548 )
...
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str ( #1553 )
...
fix: ensure orig and text are both strings in TesseractOcrCliModel
Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com >
2025-05-14 15:05:52 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables ( #1551 )
...
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com >
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type ( #1415 )
...
* support image/webp file type
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* docs: add webp image format in supported_formats.md
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
* test: add a test case for `image/webp` file
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* test: update test case of converting `image/webp` file with more ocr engines
Signed-off-by: Elwin <hzywong@gmail.com >
* style: apply styling
Signed-off-by: Elwin <hzywong@gmail.com >
* rename test file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com >
Signed-off-by: Elwin <hzywong@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-14 09:47:28 +02:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-12 10:48:07 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
DavidLee
f1658edbad
fix: mime error in document streams ( #1523 )
...
Update document.py
edit got file mime error
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS ( #1512 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-02 15:03:29 +02:00