Docling

Author	SHA1	Message	Date
nkh0472	a097ccd8d5	chore: typo fix (#1465 ) * typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> * chore: typo fix Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com> --------- Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com>	2025-04-28 08:52:09 +02:00
Maxim Lysak	94d66a0765	fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459 ) fixing double scaling in case of do_cell_matching is False Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-04-25 12:34:12 +02:00
Cesar Berrospi Ramis	ed20124544	fix(html): handle address, details, and summary tags (#1436 ) * fix(html): handle 'address' tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * fix(html): handle 'details' tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-23 09:30:59 +02:00
Eugene	8012a3e4d6	fix: Treat overflowing -v flags as DEBUG (#1419 ) Signed-off-by: Eugene <fogaprod@gmail.com>	2025-04-19 11:02:41 +02:00
Michele Dolfi	5458a88464	ci: add coverage and ruff (#1383 ) * add coverage calculation and push Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * new codecov version and usage of token Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * enable ruff formatter instead of black and isort Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff lint fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff unsafe fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add removed imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * runs 1 on linter issues Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * finalize linter fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update pyproject.toml Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-14 18:01:26 +02:00
Peter W. J. Staar	c0ba88edf1	feat(cli): add option for html with split-page mode (#1355 ) * updated the cli to output html in split-page mode Signed-off-by: Peter Staar <taa@zurich.ibm.com> * add pin for new docling-core with html split argument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * relock with fixed html export in docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update lock with docling-core fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add again chunking extras Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-14 08:41:50 +02:00
Tim Kellogg	0de70e7991	fix: auto-recognize .xlsx, .docx and .pptx files (#1340 ) * bug: auto-recognize .xlsx files Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com> * apply styling Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply to other ms office zip formats Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-14 07:45:13 +02:00
Cesar Berrospi Ramis	415b877984	fix(docx): declare image_data variable when handling pictures (#1359 ) Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-11 13:04:00 +02:00
Rowan Skewes	250399948d	fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248 ) fix: Implement PictureDescriptionApiOptions.picture_area_threshold Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com>	2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis	eef2bdea77	feat(xlsx): create a page for each worksheet in XLSX backend (#1332 ) * sytle(xlsx): enforce type hints in XLSX backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * feat(xlsx): create a page for each worksheet in XLSX backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs(xlsx): add docstrings to XLSX backend module. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docling(xlsx): add bounding boxes and page size information in cell units Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-04-11 10:29:53 +02:00
Gabe Goodhart	c605edd8e9	feat: OllamaVlmModel for Granite Vision 3.2 (#1337 ) * build: Add ollama sdk dependency Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add option plumbing for OllamaVlmOptions in pipeline_options Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full implementation of OllamaVlmModel Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Connect "granite_vision_ollama" pipeline option to CLI Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "build: Add ollama sdk dependency" After consideration, we're going to use the generic OpenAI API instead of the Ollama-specific API to avoid duplicate work. This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move OpenAI API call logic into utils.utils This will allow reuse of this logic in a generic VLM model NOTE: There is a subtle change here in the ordering of the text prompt and the image in the call to the OpenAI API. When run against Ollama, this ordering makes a big difference. If the prompt comes before the image, the result is terse and not usable whereas the prompt coming after the image works as expected and matches the non-OpenAI chat API. Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor from Ollama SDK to generic OpenAI API Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Linting, formatting, and bug fixes The one bug fix was in the timeout arg to openai_image_request. Otherwise, this is all style changes to get MyPy and black passing cleanly. Branch: OllamaVlmModel Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * remove model from download enum Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * generalize input args for other API providers Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename and refactor Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * require flag for remote services Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * disable example from CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add examples to docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-04-10 18:03:04 +02:00
Joan Fabrégat	6b696b504a	fix: Properly address page in pipeline _assemble_document when page_range is provided (#1334 ) * Fixes #1333 Signed-off-by: Joan Fabrégat <j@fabreg.at> * fix for the (dumb) MyPy type checker Signed-off-by: Joan Fabrégat <j@fabreg.at> --------- Signed-off-by: Joan Fabrégat <j@fabreg.at>	2025-04-10 16:11:28 +02:00
Maxim Lysak	355d8dc7a6	chore: Logo parameter in docling CLI, prints cute ascii logo (#1294 ) logo parameter in docling cli, prints cute ascii logo Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima	14e9c0ce9a	fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295 ) * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(docx): Improve text parsing (#1268) * chore: bump version to 2.28.4 [skip ci] Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Improve text parsing Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Flexibilize heading detection Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Fix trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Remove trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add visual grounding example (#1270) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat(docx): add text formatting and hyperlink support (#630) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(pptx): check if picture shape has an image attached (#1316) Check if picture shape has an image attached in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: update lock file (#1315) chore: update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add plugins docs (#1319) add plugin docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat: handle <code> tags as code blocks (#1320) handle <code> tags as code blocks Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>	2025-04-08 17:11:37 +02:00
Fernando Santos	0499cd1c1e	feat: handle <code> tags as code blocks (#1320 ) handle <code> tags as code blocks Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>	2025-04-08 10:32:06 +02:00
Maxim Lysak	dc3bf9ceac	fix(pptx): check if picture shape has an image attached (#1316 ) Check if picture shape has an image attached in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-04-07 17:36:56 +02:00
Simon Jégou	bfcab3d677	feat(docx): add text formatting and hyperlink support (#630 ) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>	2025-04-03 15:11:50 +02:00
Rafael Teixeira de Lima	d2d68747f9	fix(docx): Improve text parsing (#1268 ) * chore: bump version to 2.28.4 [skip ci] Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Improve text parsing Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Flexibilize heading detection Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Fix trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Remove trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>	2025-04-02 12:56:44 +02:00
Guilhem VERMOREL	b3d111a3cd	fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 ) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>	2025-03-31 10:53:49 +02:00
Maxim Lysak	7afad7e52d	fix: Fixes tables when using OCR (#1261 ) Fix for the tables when using OCR Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-03-29 10:06:00 +01:00
Maxim Lysak	8bd71e8e33	fix: Word-level pdf cells for tables (#1238 ) * word-level pdf cells for tables Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed comments Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated dependency to docling-core Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-03-28 16:34:48 +01:00
Panos Vagenas	9210812bfa	fix: improve HTML layer detection, various MD fixes (#1241 ) Markdown fixes: - properly propagate section header levels - improve handling of list subroots without text Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-03-26 16:07:14 +01:00
Panos Vagenas	85c4df887b	fix(html): fix HTML parsed heading level (#1244 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-03-26 10:30:23 +01:00
mislavmartinic	825b226fab	fix(converter): Cache same pipeline class with different options (#1152 ) * Update document_converter.py Fixing caching same class with different options by using composite key (class, options) # TODO this will ignore if different options have been defined for the same pipeline class. at row 292 Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com> * formatted script * removed unnecessary hasattr check * pre-commit chain run --------- Signed-off-by: mislavmartinic <mislav.martinic@pontistechnology.com>	2025-03-25 12:18:44 +01:00
Hoang-Long Do	6df8827231	fix(debug): Missing translation of bbox to to_bounding_box (#1220 ) * Fix: Add missing bbox attribute to PdfTextCell * Fix: Add missing bbox attribute to PdfTextCell Signed-off-by: hl2311 <dhlong2301@gmail.com> * fix: Refactor missing bbox attribute to PdfTextCell Signed-off-by: hl2311 <dhlong2301@gmail.com> * Signed-off-by: hl2311 <dhlong2301@gmail.com> fix: Refactor missing bbox attribute to PdfTextCell --------- Signed-off-by: hl2311 <dhlong2301@gmail.com>	2025-03-25 12:18:10 +01:00
Rafael Teixeira de Lima	f739d0e4c5	fix(docx): identifying numbered headers (#1231 ) * Modifications to identify numbered headers Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Add style check Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>	2025-03-25 11:41:02 +01:00
Maxim Lysak	1c26769785	feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199 ) * Initial implementation to support MLX for VLM pipeline and SmolDocling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * mlx_model unit Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Add CLI choices for VLM pipeline and model Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Initial implementation to support MLX for VLM pipeline and SmolDocling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * mlx_model unit Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Add CLI choices for VLM pipeline and model Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated minimal vlm pipeline example Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * make vlm_pipeline python3.9 compatible Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixed extract_text_from_backend definition Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated README Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated example Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated documentation Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * corrections in the documentation Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Consmetic changes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-03-19 15:38:54 +01:00
Maciej Wieczorek	b454aa1551	feat: Add PPTX notes slides (#474 ) * feat: Add PPTX notes slides Presenter notes may have useful information and should also be extracted. Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co> * feat: Move presenter notes into furniture Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co> --------- Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>	2025-03-19 14:52:09 +01:00
Christoph Auer	f5adfb9724	fix: Determine correct page size in DoclingParseV4Backend (#1196 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-03-19 11:05:42 +01:00
Rafael Teixeira de Lima	0b707d0882	fix(msword): Fixing function return in equations handling (#1194 ) * Fixing function return Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Add message Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>	2025-03-19 10:34:25 +01:00
Maxim Lysak	2f72167ff6	feat: updated vlm pipeline (with latest changes from docling-core) (#1158 ) * Draft implementation of Doctag backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated VLM pipeline doctags to docling conversion, now properly supports lists Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * preparing to migrate to new doctags deserializer Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * re-using DocTagsDocument.from_doctags_and_image_pairs Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * satisfying mypy and other checks Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added support for force_backend_text parameter Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed unnecessary transformation Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Cleaned up Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Update tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated readme Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2025-03-18 15:44:51 +01:00
Michele Dolfi	6eaae3cba0	feat: add factory for ocr engines via plugins (#1010 ) * add factory for ocr engines Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply pre-commit after rebase Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add picture description factory Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix enable option Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * switch to create methods Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * make `options` an explicit kwarg Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * keep old lock of docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add allow_external_plugins option Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add factory return and ignore options type Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>	2025-03-18 13:58:05 +01:00
Christoph Auer	3960b199d6	feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 ) * Add DoclingParseV3 backend implementation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Use docling-core with docling-parse types Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes and test updates Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix streams Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix streams Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reset tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update test cases Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update test units Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add back DoclingParse v1 backend, pipeline options Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update locks Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: update docling-core to 2.22.0 Update dependency library docling-core to latest release 2.22.0 Fix regression tests and ground truth files Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * Ground-truth files updated Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update tests, use TextCell.from_ocr property Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Text fixes, new test data Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename docling backend to v4 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Test all backends, fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reset all tests to use docling-parse v1 for now Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for DPv4 backend init, better test coverage Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * test_input_doc use default backend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-18 10:38:19 +01:00
Michele Dolfi	fa16b12316	chore: move to docling-project org (#1160 ) * chore: rename org Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update docs/faq/index.md Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * update github pages Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * revert test content Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-03-14 12:35:29 +01:00
Cesar Berrospi Ramis	f94da44ec5	fix(html): handle nested empty lists (#1154 ) Address the case of nested lists in empty list items. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-13 16:56:58 +01:00
Panos Vagenas	0945973b79	fix: use first table row as col headers (#1156 ) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima	6eb718f849	feat: equations to latex in MSWord backend (with inline groups) (#1114 ) * Equation groups Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Proper handling of orphan IDs in layout postprocessing (#1118) * Fix the handling of orphan IDs in layout postprocessing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test cases Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: bump version to 2.25.2 [skip ci] * docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124) add env var in docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(CLI): fix help message for abort options (#1130) fix help message Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * perf: New revision code formula model and document picture classifier (#1140) * new version code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new version document picture classifier Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * restored original code formula test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> --------- Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat: Use new TableFormer model weights and default to accurate model version (#1100) * feat: New tableformer model weights [WIP] Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Updated TF version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests, after merging with Main, Switched to Accurate TF model by default Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: bump version to 2.26.0 [skip ci] * fix: Pass tests, update docling-core to 2.22.0 (#1150) fix: update docling-core to 2.22.0 Update dependency library docling-core to latest release 2.22.0 Fix regression tests and ground truth files Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * Updating content hash Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis	aa92a57fa9	fix: Pass tests, update docling-core to 2.22.0 (#1150 ) fix: update docling-core to 2.22.0 Update dependency library docling-core to latest release 2.22.0 Fix regression tests and ground truth files Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-13 09:45:55 +01:00
Christoph Auer	eb97357b05	feat: Use new TableFormer model weights and default to accurate model version (#1100 ) * feat: New tableformer model weights [WIP] Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Updated TF version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests, after merging with Main, Switched to Accurate TF model by default Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-03-11 10:53:49 +01:00
Matteo	5e30381c0d	perf: New revision code formula model and document picture classifier (#1140 ) * new version code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new version document picture classifier Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * new code formula model Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * restored original code formula test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> --------- Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>	2025-03-11 10:15:28 +01:00
Michele Dolfi	4d64c4c0b6	fix(CLI): fix help message for abort options (#1130 ) fix help message Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-03-07 14:47:49 +01:00
Christoph Auer	c56ab3a66b	fix: Proper handling of orphan IDs in layout postprocessing (#1118 ) * Fix the handling of orphan IDs in layout postprocessing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test cases Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-03-05 14:30:59 +01:00
Michele Dolfi	8dc0562542	fix: enable locks for threadsafe pdfium (#1052 ) * enable locks for threadsafe pdfium Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix deadlock in pypdfium2 backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-03-02 20:06:44 +01:00
Peter W. J. Staar	e25d557c06	refactor: add the contentlayer to html-backend (#1040 ) * added the contentlayer to html-backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the handle_image function Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code of html backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * test(html): add more info if a test case fails Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor(html): put parsed item in body if doc has no header In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: set TextItem label to 'text' instead of 'paragraph' Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis	de7b963b09	fix(html): use 'start' attribute when parsing ordered lists from HTML docs (#1062 ) * fix(html): use 'start' attribute in ordered lists When parsing ordered lists in HTML, take into account the 'start' attribute if it exists. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore(html): reduce verbosity in HTML backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-27 09:46:57 +01:00
Christoph Auer	3c9fe76b70	feat: [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054 ) * Skeleton for SmolDocling model and VLM Pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * wip smolDocling inference and vlm pipeline Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * WIP, first working code for inference of SmolDocling, and vlm pipeline assembly code, example included. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixes to preserve page image and demo export to html Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Enabled figure support in vlm_pipeline Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fix for table span compute in vlm_pipeline Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Properly propagating image data per page, together with predicted tags in VLM pipeline. This enables correct figure extraction and page numbers in provenances Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Cleaned up logs, added pages to vlm_pipeline, basic timing per page measurement in smol_docling models Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Replaced hardcoded otsl tokens with the ones from docling-core tokens.py enum Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tokens/sec measurement, improved example Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added capability for vlm_pipeline to grab text from preconfigured backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Exposed "force_backend_text" as pipeline parameter Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Flipped keep_backend to True for vlm_pipeline assembly to work Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated vlm pipeline assembly and smol docling model code to support updated doctags Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixing doctags starting tag, that broke elements on first line during assembly Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Introduced SmolDoclingOptions to configure model parameters (such as query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Moved artifacts_path for SmolDocling into vlm_options instead of global pipeline option Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * New assembly code for latest model revision, updated prompt and parsing of doctags, updated logging Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated example of Smol Docling usage Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added captions for the images for SmolDocling assembly code, improved provenance definition for all elements Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Update minimal smoldocling example Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix repo id Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleaned up unnecessary logging Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * More elegant solution in removing the input prompt Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed minimal_smol_docling example from CI checks Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Removed special html code wrapping when exporting to docling document, cleaned up comments Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Addressing PR comments, added enabled property to SmolDocling, and related VLM pipeline option, few other minor things Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Moved keep_backend = True to vlm pipeline Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed pipeline_options.generate_table_images from vlm_pipeline (deprecated in the pipelines) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example on how to get original predicted doctags in minimal_smol_docling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removing changes from base_pipeline Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Replaced remaining strings to appropriate enums Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated poetry.lock Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * re-built poetry.lock Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Generalize and refactor VLM pipeline and models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename example Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Expose control over using flash_attention_2 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix VLM example exclusion in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add back device_map and accelerate Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make drawing code resilient against bad bboxes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: clean up code and comments Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: more cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: fix leftover .to(device) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: add proper table provenance Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-02-26 14:43:26 +01:00
Panos Vagenas	ab683e4fb6	feat(cli): add option for downloading all models, refine help messages (#1061 ) * chore(cli): update download help messages Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * add `--all` flag to model download CLI Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>	2025-02-26 13:27:29 +01:00
Michele Dolfi	e197225739	fix: vlm using artifacts path (#1057 ) * fix usage of artifacts path Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add granite vision to the download utils Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-02-26 08:33:50 +01:00
Cesar Berrospi Ramis	1b0ead6907	fix(html): Parse text in div elements as TextItem (#1041 ) feat(html): Parse text in div elements as TextItem Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-24 12:38:29 +01:00
Christoph Auer	c93e36988f	feat: Implement new reading-order model (#916 ) * Implement new reading-order model, replacing DS GLM model (WIP) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update reading-order model branch Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile [skip ci] Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add captions, footnotes and merges [skip ci] Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updates for reading-order implementation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updates for reading-order implementation Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update tests and lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes, update tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add normalization, update tests again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update tests with code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Push final lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * sanitize text Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Inlcude furniture, Update tests with furniture Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix content_layer assignment Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Delete empty file docling/models/ds_glm_model.py Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>	2025-02-20 17:51:17 +01:00

1 2 3 4 5

208 Commits