Docling

Author	SHA1	Message	Date
mkrssg	1350a8d3e5	fix(msword_backend): Identify text in the same line after an image #1425 (#1610 ) * fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * test: added groundtruth test files for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> * fix: extraneous empty paragraphs for test files Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> --------- Signed-off-by: Michael Krissgau <michael.krissgau@ibm.com> Co-authored-by: Michael Krissgau <michael.krissgau@ibm.com>	2025-06-20 10:55:30 +02:00
Christoph Auer	dd7f64ff28	fix: Ensure uninitialized pages are removed before assembling document (#1812 ) Ensure uninitialized pages are removed before assembling document Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-19 07:33:25 +02:00
Panos Vagenas	861abcdcb0	feat(markdown): add formatting & improve inline support (#1804 ) feat(markdown): support formatting & hyperlinks Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-06-18 15:57:57 +02:00
Shkarupa Alex	215b540f6c	feat: Maximum image size for Vlm models (#1802 ) * Image scale moved to base vlm options. Added max_size image limit (options and vlm models). * DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com> I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: e93602a0d02fdb6f6dea1f65686cffcc4c616011 Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com> --------- Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>	2025-06-18 12:57:37 +02:00
Mahafuzur Rahman	dbab30e92c	fix: formula conversion with page_range param set (#1791 ) When page_range param is used for formula conversion, the system throws list index out of range error. Included tests to validate that the fix works. Signed-off-by: Masum <masumsofts@yahoo.com>	2025-06-17 13:58:45 +02:00
Martin Wind	f28d23cf03	fix: pptx line break and space handling (#1664 ) Signed-off-by: Martin Wind <martin.wind@im-c.at>	2025-06-16 10:44:30 +02:00
Cesar Berrospi Ramis	b886e4df31	fix(asciidoc): set default size when missing in image directive (#1769 ) The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values. Refactor static methods as such and add the staticmethod decorator. Extend the regression test for this fix. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-06-16 10:38:46 +02:00
Christoph Auer	7d3302cb48	feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 ) * Keep page.parsed_page.textline_cells and page.cells in sync, including OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make page.parsed_page the only source of truth for text cells Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Correctly compute PDF boxes from pymupdf Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use different OCR engine order Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add type hints and fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * One more test fix Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove with pypdfium2_lock from caller sites Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix typing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-13 19:01:55 +02:00
Bruno Rigal	7a275c7637	fix: Handle NoneType error in MsPowerpointDocumentBackend (#1747 ) fix:nonetyperror in pptx backend Signed-off-by: Bruno Rigal <bruno.rigal@probayes.com> Co-authored-by: Bruno Rigal <bruno.rigal@probayes.com>	2025-06-10 19:43:20 +02:00
Ayraf	df140227c3	feat: support xlsm files (#1520 ) * code for xlsm support * updated support for xlsm * updated code for xlsm support * Update docling_parse_v4_backend.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update docling_parse_v4_backend.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel_xlsm.py updated the tests/test_backend_msexcel_xlsm.py: have a function starting with test removed all print statements ** To add an explicit assert {test}=={pred} Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update base_models.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update test_backend_msexcel_xlsm.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Update document_converter.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * Delete tests/test_backend_msexcel_xlsm.py Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * xlsm file Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> * run tests * ran tests * Fix tests, upgrade XSLM example to a valid file Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-10 16:55:59 +02:00
Peter W. J. Staar	6613b9e98b	fix: prov for merged-elems (#1728 ) * fix: prov for merged-elems Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Reset pyproject.toml Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-10 11:22:42 +02:00
Maras Ioannis	e979750ce9	fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718 ) * fix: initialize df_osd to avoid uninitialized variable error Signed-off-by: IoannisMaras <maras2002@gmail.com> * Fix formatting Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Satisfy mypy, regenerate OCR tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: IoannisMaras <maras2002@gmail.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-06-10 10:57:45 +02:00
Michele Dolfi	f7f31137f1	fix: allow custom torch_dtype in vlm models (#1735 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-06-10 10:52:15 +02:00
AndrewTsai0406	9dbcb3d7d4	fix: Improve extraction from textboxes in Word docs (#1701 ) * fix/docx_text_box_extraction Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local> * fix/docx_text_box_extraction Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local> --------- Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local> Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local>	2025-06-06 11:37:46 +02:00
Eugene	a2b83fe4ae	fix: Add WEBP to the list of image file extensions (#1711 ) feat: Add WEBP to the list of image file extensions Signed-off-by: Eugene <fogaprod@gmail.com>	2025-06-05 09:09:27 +02:00
Peter W. J. Staar	cfdf4cea25	feat: new vlm-models support (#1570 ) * feat: adding new vlm-models support Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the transformers Signed-off-by: Peter Staar <taa@zurich.ibm.com> * got microsoft/Phi-4-multimodal-instruct to work Signed-off-by: Peter Staar <taa@zurich.ibm.com> * working on vlm's Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactoring the VLM part Signed-off-by: Peter Staar <taa@zurich.ibm.com> * all working, now serious refacgtoring necessary Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactoring the download_model Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the formulate_prompt Signed-off-by: Peter Staar <taa@zurich.ibm.com> * pixtral 12b runs via MLX and native transformers Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the VlmPredictionToken Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactoring minimal_vlm_pipeline Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the MyPy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added pipeline_model_specializations file Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to get Phi4 working again ... Signed-off-by: Peter Staar <taa@zurich.ibm.com> * finalising last points for vlms support Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the pipeline for Phi4 Signed-off-by: Peter Staar <taa@zurich.ibm.com> * streamlining all code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixing the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the html backend to the VLM pipeline Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the static load_from_doctags Signed-off-by: Peter Staar <taa@zurich.ibm.com> * restore stable imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use AutoModelForVision2Seq for Pixtral and review example (including rename) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove unused value Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * refactor instances of VLM models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip compare example in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use lowercase and uppercase only Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename pipeline_vlm_model_spec Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move more argument to options and simplify model init Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add supported_devices Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove not-needed function Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * exclude minimal_vlm Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * missing file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add message for transformers version Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename to specs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use module import and remove MLX from non-darwin Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove hf_vlm_model and add extra_generation_args Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use single HF VLM model class Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove torch type Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add docs for vision models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-06-02 17:01:06 +02:00
Cesar Berrospi Ramis	984cb137f6	fix: guess HTML content starting with script tag (#1673 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis	3942923125	chore: fix or ignore runtime and deprecation warnings (#1660 ) * chore: fix or catch deprecation warnings Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: update poetry lock with latest docling-core Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-05-28 17:55:31 +02:00
Peter W. J. Staar	b356b33059	feat: Add visualization of bbox on page with html export. (#1663 ) * feat: Add visualization of bbox on page with html export. Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the cli argument to show_layout Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-05-28 13:10:38 +02:00
DavidLee	51d3450915	fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665 ) Update document.py fix: when mime not "application/xml" or "text/plain" raise UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte Signed-off-by: DavidLee <yongsheng_li@foxmail.com>	2025-05-27 14:06:05 +02:00
Said Gürbüz	c2f595d283	fix: fix ZeroDivisionError for cell_bbox.area() (#1636 ) fix ZeroDivisionError for cell_bbox.area() Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>	2025-05-22 13:43:33 +02:00
Clément Doumouro	45265bf8b1	feat(ocr): auto-detect rotated pages in Tesseract (#1167 ) * fix(ocr): tesseract support mis-oriented documents Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): update missing test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): rotate image to the natural orientation before layout prediction Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): move bounding bow rotation util to orientation.py Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): refactor rotation utilities Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): avoid to swallow tesseract errors causing orientation detection failures Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel` * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel` * chore(ocr): default `TesseractOcrCliModel._is_auto` to `False` * fix(ocr): fix `TesseractOcrCliModel._is_auto` computation * chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel` --------- Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>	2025-05-21 18:12:33 +02:00
Christoph Auer	90875247e5	feat: Establish confidence estimation for document and pages (#1313 ) * Establish confidence field, propagate layout confidence through Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add OCR confidence and parse confidence (stub) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add parse quality rules, use 5% percentile for overall and parse scores Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Heuristic updates Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix garbage regex Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move grade to page Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Introduce mean_score and low_score, consistent aggregate computations Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add confidence test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-05-21 12:32:49 +02:00
MoheyElDin Badr	f4d9d4111b	fix: Fix issue with detecting docx files, and files with upper case extensions (#1609 ) fix detecting files with uppercase extensions Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>	2025-05-20 19:42:37 +02:00
Said Gürbüz	0e00a263fa	fix: load_from_doctags static usage (#1617 ) * fix load_from_doctags usage Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * update dependencies Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * fix lock file Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * revert lock file Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * update lock file Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> --------- Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>	2025-05-20 15:06:12 +02:00
Krishnan	f2e9c0784c	fix: incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371 ) * Fix force_backend_text Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local> * empty commit to retrigger CI Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>	2025-05-20 09:59:38 +02:00
Pedro Ribeiro	98b5eeb844	fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549 ) get merged_text from boundingbox instead of merging it to prevent overlaps Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>	2025-05-19 15:26:00 +02:00
AndrewTsai0406	12a0e64892	feat: add textbox content extraction in msword_backend (#1538 ) * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> * feat: add textbox content extraction in msword_backend Signed-off-by: Andrew <tsai247365@gmail.com> --------- Signed-off-by: Andrew <tsai247365@gmail.com>	2025-05-19 15:01:36 +02:00
Vinay R Damodaran	3a04f2a367	feat: Improve parallelization for remote services API calls (#1548 ) * Provide the option to make remote services call concurrent Signed-off-by: Vinay Damodaran <vrdn@hey.com> * Use yield from correctly? Signed-off-by: Vinay Damodaran <vrdn@hey.com> * not do amateur hour stuff Signed-off-by: Vinay Damodaran <vrdn@hey.com> --------- Signed-off-by: Vinay Damodaran <vrdn@hey.com>	2025-05-14 15:47:55 +02:00
jimkarag02	9f8b479f17	fix(ocr): orig field in TesseractOcrCliModel as str (#1553 ) fix: ensure orig and text are both strings in TesseractOcrCliModel Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com>	2025-05-14 15:05:52 +02:00
Alex Sokolov	2efb7a7c06	fix(settings): fix nested settings load via environment variables (#1551 ) Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com>	2025-05-14 13:42:10 +02:00
Elwin	12dab0a1e8	feat: support image/webp file type (#1415 ) * support image/webp file type Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com> Signed-off-by: Elwin <hzywong@gmail.com> * docs: add webp image format in supported_formats.md Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com> Signed-off-by: Elwin <hzywong@gmail.com> * test: add a test case for `image/webp` file Signed-off-by: Elwin <hzywong@gmail.com> * style: apply styling Signed-off-by: Elwin <hzywong@gmail.com> * test: update test case of converting `image/webp` file with more ocr engines Signed-off-by: Elwin <hzywong@gmail.com> * style: apply styling Signed-off-by: Elwin <hzywong@gmail.com> * rename test file Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com> Signed-off-by: Elwin <hzywong@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-05-14 09:47:28 +02:00
Marco Fargetta	4046d0b2f3	fix: AsciiDoc header identification (#1562 ) (#1563 ) Fix regular expression to identify header lines in AsciiDoc avoiding to match defined blocks. Signed-off-by: Marco Fargetta <mfargett@redhat.com>	2025-05-13 11:17:26 +02:00
Michele Dolfi	127e38646f	fix: add smoldocling in download utils (#1577 ) add smoldocling in download utils Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-05-12 10:48:07 +02:00
Cesar Berrospi Ramis	776e7ecf9a	fix(HTML): handle row spans in header rows (#1536 ) * chore(HTML): log the stacktrace of errors Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * fix(HTML): handle row headers like in pivot tables Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-05-09 15:14:32 +02:00
DavidLee	f1658edbad	fix: mime error in document streams (#1523 ) Update document.py edit got file mime error Signed-off-by: DavidLee <yongsheng_li@foxmail.com>	2025-05-06 09:30:46 +02:00
Michele Dolfi	7c705739f9	fix: usage of hashlib for FIPS (#1512 ) fix usage of hashlib for FIPS Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-05-02 15:03:29 +02:00
Ihar Hrachyshka	b147331f2a	chore: restore typing hint for self.script_readers (#1500 ) With future annotations, typing hints resolution is always deferred. https://peps.python.org/pep-0563/ Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-04-30 20:33:27 +02:00
Ben Browning	4ab7e9ddfb	fix: Guard against attribute errors in TesseractOcrModel __del__ (#1494 ) This moves the initialization of the `reader` and `script_readers` attributes to before we attempt to import tesserocr, so that when later accessing these attributes in the garbage collection method `__del__` the attributes exist. This requires changing the typing of the `script_readers` dict value to `Any` because we cannot yet reference its actual strong type, since it's a tesserocr value. This prevents throwing an exception during garbage collection for cases where the TesseractOcrModel instance didn't properly initialize, like when it throws an `ImportError` during its initializer. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-30 17:51:33 +02:00
Zach Cox	cc453961a9	fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496 ) fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel Signed-off-by: Zach Cox <zach.s.cox@gmail.com>	2025-04-30 08:02:52 +02:00
Peter W. J. Staar	976e92e289	fix: updated the time-recorder label for reading order (#1490 ) * fix: updated the time-recorder label for reading order Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2025-04-29 13:02:53 +02:00
nkh0472	a097ccd8d5	chore: typo fix (#1465 ) * typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> --------- Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>	2025-04-28 08:52:09 +02:00
Maxim Lysak	94d66a0765	fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459 ) fixing double scaling in case of do_cell_matching is False Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-04-25 12:34:12 +02:00
Cesar Berrospi Ramis	ed20124544	fix(html): handle address, details, and summary tags (#1436 ) * fix(html): handle 'address' tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * fix(html): handle 'details' tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-23 09:30:59 +02:00
Eugene	8012a3e4d6	fix: Treat overflowing -v flags as DEBUG (#1419 ) Signed-off-by: Eugene <fogaprod@gmail.com>	2025-04-19 11:02:41 +02:00
Michele Dolfi	5458a88464	ci: add coverage and ruff (#1383 ) * add coverage calculation and push Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * new codecov version and usage of token Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * enable ruff formatter instead of black and isort Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff lint fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff unsafe fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add removed imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * runs 1 on linter issues Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * finalize linter fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update pyproject.toml Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-14 18:01:26 +02:00
Peter W. J. Staar	c0ba88edf1	feat(cli): add option for html with split-page mode (#1355 ) * updated the cli to output html in split-page mode Signed-off-by: Peter Staar <taa@zurich.ibm.com> * add pin for new docling-core with html split argument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * relock with fixed html export in docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update lock with docling-core fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add again chunking extras Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-14 08:41:50 +02:00
Tim Kellogg	0de70e7991	fix: auto-recognize .xlsx, .docx and .pptx files (#1340 ) * bug: auto-recognize .xlsx files Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com> * apply styling Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply to other ms office zip formats Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-14 07:45:13 +02:00
Cesar Berrospi Ramis	415b877984	fix(docx): declare image_data variable when handling pictures (#1359 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-11 13:04:00 +02:00
Rowan Skewes	250399948d	fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248 ) fix: Implement PictureDescriptionApiOptions.picture_area_threshold Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com>	2025-04-11 11:14:05 +02:00

1 2 3 4 5

249 Commits