Docling

Author	SHA1	Message	Date
Michele Dolfi	8dc0562542	fix: enable locks for threadsafe pdfium (#1052 ) * enable locks for threadsafe pdfium Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix deadlock in pypdfium2 backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-03-02 20:06:44 +01:00
Peter W. J. Staar	e25d557c06	refactor: add the contentlayer to html-backend (#1040 ) * added the contentlayer to html-backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the handle_image function Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code of html backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * test(html): add more info if a test case fails Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor(html): put parsed item in body if doc has no header In case an HTML does not have any header tag, all parsed items are placed in DoclingDocument's body content layer. HTML paragraphs ('p' tags) are parsed as text items with paragraph label. Update test ground truth accoring to the changes above. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: set TextItem label to 'text' instead of 'paragraph' Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-03-02 10:37:53 -05:00
Cesar Berrospi Ramis	de7b963b09	fix(html): use 'start' attribute when parsing ordered lists from HTML docs (#1062 ) * fix(html): use 'start' attribute in ordered lists When parsing ordered lists in HTML, take into account the 'start' attribute if it exists. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore(html): reduce verbosity in HTML backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-27 09:46:57 +01:00
Cesar Berrospi Ramis	1b0ead6907	fix(html): Parse text in div elements as TextItem (#1041 ) feat(html): Parse text in div elements as TextItem Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-24 12:38:29 +01:00
Cesar Berrospi Ramis	1ac010354f	test: avoid testing exact JSON (#1027 ) * test: avoid testing exact JSON Avoid testing exact JSON output in html and xml backends. Reuse the JSON verify helper function among backend test files. Improve type annotations in html backend. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * Update tests/test_backend_patent_uspto.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-02-20 16:20:07 +01:00
Cesar Berrospi Ramis	7450050ace	refactor: upgrade BeautifulSoup4 with type hints (#999 ) * refactor: upgrade BeautifulSoup4 with type hints Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints). Refactor backends using BeautifulSoup4 to comply with type hints. Apply style simplifications and improvements for consistency. Remove variables and functions that are never used. Remove code duplication between backends for parsing HTML tables. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * build: allow beautifulsoup4 version 4.12.3 Allow older version of beautifulsoup4 and ensure compatibility. Update library dependencies. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-18 11:30:47 +01:00
Cesar Berrospi Ramis	428b656793	feat(xml-jats): parse XML JATS documents (#967 ) * chore(xml-jats): separate authors and affiliations In XML PubMed (JATS) backend, convert authors and affiliations as they are typically rendered on PDFs. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * fix(xml-jats): replace new line character by a space Instead of removing new line character from text, replace it by a space character. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * feat(xml-jats): improve existing parser and extend features Partially support lists, respect reading order, parse more sections, support equations, better text formatting. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore(xml-jats): rename PubMed objects to JATS Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-17 10:43:31 +01:00
Tobias Strebitzer	00d9405b0a	feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 ) * feat: Implement csv backend and format detection Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * test: Implement csv parsing and format tests Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * docs: Add example and CSV format documentation Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * feat: Add support for various CSV dialects and update documentation Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> * feat: Add validation for delimiters and tests for inconsistent csv files Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com> --------- Signed-off-by: Tobias Strebitzer <tobias.strebitzer@magloft.com>	2025-02-14 08:55:09 +01:00
Panos Vagenas	90b766e2ae	fix(markdown): handle nested lists (#910 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-07 12:55:12 +01:00
Vladimir Gurevich	722a6eb7b9	fix(msword_backend): handle conversion error in label parsing (#896 ) Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors. Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com> Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>	2025-02-06 12:30:51 +01:00
Panos Vagenas	5ac2887e4a	fix(markdown): fix parsing if doc ending with table (#873 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-03 14:38:38 +01:00
Panos Vagenas	94751a78f4	fix(markdown): add support for HTML content (#855 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-02-03 12:21:05 +01:00
Cesar Berrospi Ramis	0cd81a8122	fix(docx): merged table cells not properly converted (#857 ) * fix(docx): merged cells not properly converted Fix conversion issue of merged cells in Word tables leading to repeated text. Simplify Word table conversion code. Add docx file with several table formats for regression tests. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add type hinting to docx backend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-02-03 10:20:03 +01:00
Maxim Lysak	eff16b62cc	fix: Processing of placeholder shapes in pptx that have text but no bbox (#868 ) Processing of placeholder shapes in pptx that have text but no bbox Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-02-03 09:33:33 +01:00
Maxim Lysak	d727b04ad0	feat(docx): Support of SDTs in docx backend (#853 ) Support of table of content containers in docx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-01-31 14:52:24 +01:00
Maxim Lysak	2c037ae62e	fix: Fixed docx import with headers that are also lists (#842 ) * Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> * Update docling/backend/msword_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-31 10:51:21 +01:00
Michele Dolfi	2a1f8afe7e	fix: use new add_code in html backend and add more typing hints (#850 ) fix add_code in html backend and add more typing hints Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-31 09:54:17 +01:00
Panos Vagenas	bccb022fc8	fix(markdown): fix empty block handling (#843 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-30 16:22:29 +01:00
Maxim Lysak	fea0a99a95	fix: Fix for the crash when encountering WMF images in pptx and docx (#837 ) * Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated faq Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2025-01-30 14:58:27 +01:00
Panos Vagenas	5aed9f8aeb	fix: fix single newline handling in MD backend (#824 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2025-01-28 19:05:55 +01:00
Cesar Berrospi Ramis	a112d7a035	fix: parse html with omitted body tag (#818 ) * fix: parse HTML files without body tag Parse HTML files without 'body' tag, since it is optional in HTML5 specification. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: ensure docling converts HTML without body tag Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-27 16:59:00 +01:00
Cesar Berrospi Ramis	c2ae1cc4ca	docs: description of supported formats and backends (#788 ) * chore: remove type-ignore marks for attaching text to non GroupItems After commit b74208 of docling-core, text items can be attached to any NodeItem and therefore the ignore[arg-type] type marks can be removed. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * test: remove unnecessary imports Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add documentation on supported formats and backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * docs: add notebook example with XML backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-26 08:10:33 +01:00
Peter W. J. Staar	a458e298ca	fix: added extraction of byte-images in excel (#804 ) * fix(msexcel): ignore Mypy checking for _find_images_in_sheet function Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> * fixed some issues Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * pinned pillow in pyproject Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Jiun An Tsai <andrew@247365-Macbook.local> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Jiun An Tsai <andrew@247365-Macbook.local> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-24 18:48:02 +01:00
Panos Vagenas	88a0e66adc	feat: add Docling JSON ingestion (#783 ) * feat: add Docling JSON ingestion Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update conversion as per review comments, add tests, revert Docling JSON disambiguation, document intricacies Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * Update docling/backend/json/docling_json_backend.py Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2025-01-24 18:05:23 +01:00
Matteo	3213b247ad	feat: Code and equation model for PDF and code blocks in markdown (#752 ) * propagated changes for new CodeItem class Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Rebased branch on latest main. changes for CodeItem Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused files Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * chore: update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * pin latest docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update docling-core pinning Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use new add_code in backends and update typing in MD backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * added if statement for backend Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed unused import Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * removed print statements Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * gt for new pdf Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * Update docling/pipeline/standard_pdf_pipeline.py Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> * fixed doc comment of __call__ function of code_formula_model Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> * fix artifacts_path type Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move expansion_factor to base class Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Matteo Omenetti <omenetti.matteo@gmail.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-24 16:54:22 +01:00
Michele Dolfi	57fc28d3d8	refactor: allow the usage of backends in the enrich models and generalize the interface (#742 ) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 09:52:38 +01:00
Christoph Auer	5a060f237d	fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719 ) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-10 10:38:49 +01:00
Christoph Auer	42856fdf79	fix: Let BeautifulSoup detect the HTML encoding (#695 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-07 15:49:28 +01:00
Jinfeng Sun	d49650c54f	fix(mspowerpoint): handle invalid images in PowerPoint slides (#650 ) - Add error handling for images that cannot be loaded by Pillow - Improve resilience when encountering corrupted or unsupported image formats - Maintain processing of other slide elements even if an image fails to load Signed-off-by: Tendo33 <sjf1998112@gmail.com>	2025-01-07 13:58:10 +01:00
Lucas Morin	fd034802b6	feat: Create a backend to transform PubMed XML files to DoclingDocument (#557 ) Signed-off-by: lucas-morin <lucas.morin222@gmail.com>	2024-12-17 19:27:09 +01:00
Cesar Berrospi Ramis	4e087504cc	feat: create a backend to parse USPTO patents into DoclingDocument (#606 ) * feat: add PATENT_USPTO as input format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * feat: add USPTO backend parser Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: change the name of the USPTO input format Change the name of the patent USPTO input format to show the typical format (XML). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: address several input formats with same mime type Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: group XML backend parsers in a subfolder Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add safe initialization of PatentUsptoDocumentBackend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2024-12-17 16:35:23 +01:00
Christoph Auer	aca57f0527	feat: docling-parse v2 as default PDF backend (#549 ) * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Upgrade to ds-glm 1.0 and docling-parse 3.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lock Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix DP2 backend code, change CLI default backend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-09 13:26:17 +01:00
Maxim Lysak	eb7ffcdd1c	fix: Correcting DefaultText ID for MS Word backend (#537 ) Correcting DefaultText ID for MS Word backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 15:48:35 +01:00
Maxim Lysak	3e073dfbeb	feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534 ) Using style id instead of style names, which should be localization agnostic Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 15:17:56 +01:00
Maxim Lysak	b730b2d7a0	fix: Missing text in docx (t tag) when embedded in a table (#528 ) Fix for missing text in docx (t tag) when embedded in a table Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 12:37:25 +01:00
Manuel030	767563bf8b	fix: use correct image index in word backend (#442 ) * fix image index in word backend Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * fix: Fixes for wordx (#432) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated lxml dependency version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * sign dco Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * correct rebase error Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> --------- Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-27 13:45:07 +01:00
Maxim Lysak	d0a1180478	fix: Fixes for wordx (#432 ) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated lxml dependency version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-26 14:44:43 +01:00
Peter W. J. Staar	926dfd29d5	feat: added excel backend (#334 ) * feat: added excel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first msexcel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tooling for the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working version for excel parsing of tables Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactor EXCEL to XLSX Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the unit tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran poetry lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding images to output [WIP] Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tests for merged cells in excel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2024-11-19 12:21:17 +01:00
Maxim Lysak	7a97d7119f	feat: Extracting picture data for raster images found in PPTX (#349 ) * Added picture data for pptx pictures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tests for pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Inferring image DPI from pptx file Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-18 15:22:28 +01:00
Maxim Lysak	8533039b0c	fix: Fixing images in the input Word files (#330 ) * Fixing images identification in the input Word files Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Populating extracted image data into docling picture for wordx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed base64 dependency in msword_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-14 13:33:34 +01:00
Maxim Lysak	fb8ba861e2	fix: Handling of single-cell tables in DOCX backend (#314 ) * Handling of single-cell tables in DOCX backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * returned try-catch on tables handling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * proceed processing the content of single cell table as if its just part of the body Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example of trickly 1 cell table docx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-12 15:20:55 +01:00
Maxim Lysak	81c8243a8b	fix: Added handling of grouped elements in pptx backend (#307 ) * Added handling of grouped elements in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * updated log.warn to warning Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 16:38:21 +01:00
Maxim Lysak	53bf2d1790	Added handling of code blocks in html with <pre> tag (#302 ) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 15:00:11 +01:00
Ikko Eltociear Ashimine	c3098e3c12	chore: fix typo (#241 ) * chore: update pypdfium2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * chore: update docling_parse_v2_backend.py occured -> occurred Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> --------- Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>	2024-11-05 16:20:04 +01:00
Christoph Auer	2a2c65bf4f	feat: Add pipeline timings and toggle visualization, establish debug settings (#183 ) * Add settings to turn visualization on or off Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add profiling code to all models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Refactor and fix profiling codes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Visualization codes output PNG to debug dir Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for time logging Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Optimize imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add start_timestamps to ProfilingItem Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-10-30 15:04:19 +01:00
Peter W. J. Staar	f542460af3	fix: fix duplicate title and heading + add e2e tests for html and docx (#186 ) * add real e2e tests for html and docx Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the output of itxt Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the text Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the tests (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the examples (1) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the output of the test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the tests, moved the ground-truth Signed-off-by: Peter Staar <taa@zurich.ibm.com> * moved the ground-truth data Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the html tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * restructure title fix (#187) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-30 13:14:56 +01:00
Panos Vagenas	b9f5c74a7d	fix: fix header levels for DOCX & HTML (#184 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-28 17:02:52 +01:00
Maxim Lysak	94d0729c50	fix: handling of long sequence of unescaped underscore chars in markdown (#173 ) * Fix for md hanging when encountering long sequence of unescaped underscore chars Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added comment explaining reason for fix Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixed trailing inline text handling (at the end of a file), and corrected underscore sequence shortening Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * making fix more rare Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-10-28 16:34:48 +01:00
Maxim Lysak	7d19418b77	fix: HTML backend, fixes for Lists and nested texts (#180 ) * Fixes for HTML backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed prints Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaning up Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-10-25 20:14:04 +02:00
Maxim Lysak	88c1673057	fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178 ) * Small fix to properly handle trailing inline text in the md backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added proper handling of headers with bold, italic or emphasis Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed print Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Made smarter processing of headers, with arbitrary styling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated docling-core to 2.2.1 Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests because of the change in Markdown export in docling-core Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-10-25 18:02:20 +02:00

1 2

68 Commits