Docling

Author	SHA1	Message	Date
Michele Dolfi	57fc28d3d8	refactor: allow the usage of backends in the enrich models and generalize the interface (#742 ) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2025-01-15 09:52:38 +01:00
Christoph Auer	5a060f237d	fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719 ) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-10 10:38:49 +01:00
Christoph Auer	5cb4cf6f19	fix: Correct scaling of debug visualizations, tune OCR (#700 ) * fix: Correct scaling of debug visualizations, tune OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: remove unused imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update docling-core Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-08 12:26:44 +01:00
Christoph Auer	42856fdf79	fix: Let BeautifulSoup detect the HTML encoding (#695 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2025-01-07 15:49:28 +01:00
Jinfeng Sun	d49650c54f	fix(mspowerpoint): handle invalid images in PowerPoint slides (#650 ) - Add error handling for images that cannot be loaded by Pillow - Improve resilience when encountering corrupted or unsupported image formats - Maintain processing of other slide elements even if an image fails to load Signed-off-by: Tendo33 <sjf1998112@gmail.com>	2025-01-07 13:58:10 +01:00
Luke Harrison	0ee849e8bc	feat: added http header support for document converter and cli (#642 ) * added http header support for document converter and cli Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * fixed formatting and typing issues Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> * use pydantic to parse dict suggested by @dolfim-ibm Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> --------- Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com> Signed-off-by: Luke Harrison <luke.harrison1@ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2025-01-07 10:15:14 +01:00
Lucas Morin	fd034802b6	feat: Create a backend to transform PubMed XML files to DoclingDocument (#557 ) Signed-off-by: lucas-morin <lucas.morin222@gmail.com>	2024-12-17 19:27:09 +01:00
Christoph Auer	60dc852f16	feat: Updated Layout processing with forms and key-value areas (#530 ) * Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Updated test ground-truth Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Correct the way to set GPU for EasyOCR, RapidOCR Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Ocr AccleratorDevice Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Merge pull request #556 from DS4SD/cau/layout-processing-improvement feat: layout processing improvements and bugfixes * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update HF model ref, reset test generate Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Repin to release package versions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Many layout processing improvements, add document index type Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update pinnings to docling-core Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix table box snapping Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for cluster pre-ordering Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Introduce OCR confidence, propagate to orphan in post-processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix form and key value area groups Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust confidence in EasyOcr Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Roll back CLI changes from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test GT Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update docling-core pinning Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Annoying fixes for historical python versions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test GT for legacy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Comment cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-12-17 17:32:24 +01:00
Cesar Berrospi Ramis	4e087504cc	feat: create a backend to parse USPTO patents into DoclingDocument (#606 ) * feat: add PATENT_USPTO as input format Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * feat: add USPTO backend parser Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: change the name of the USPTO input format Change the name of the patent USPTO input format to show the typical format (XML). Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: address several input formats with same mime type Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * refactor: group XML backend parsers in a subfolder Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore: add safe initialization of PatentUsptoDocumentBackend Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>	2024-12-17 16:35:23 +01:00
itsainii	3b53bd38c8	feat: Add Easyocr parameter recog_network (#613 ) * Update easyocr_model.py Added this line of code to get recog_network of easyocr parameter recog_network = self.options.recog_network Signed-off-by: itsainii <aininawawii@gmail.com> * Update pipeline_options.py Added this line in EasyOcrOptions function recog_network: Optional[str] = 'standard' Signed-off-by: itsainii <aininawawii@gmail.com> * Add Easyocr recog_network parameter Signed-off-by: itsainii <aininawawii@gmail.com> --------- Signed-off-by: itsainii <aininawawii@gmail.com>	2024-12-17 09:47:18 +01:00
Nikos Livathinos	19fad9261c	feat: Introduce support for GPU Accelerators (#593 ) * Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Clean up imports again Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Updated test ground-truth Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Rollback changes from main Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update test gt Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unused debug settings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Review fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Nail the accelerator defaults for MPS Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-12-13 17:45:22 +01:00
Abhishek Kumar	3da166eafa	feat: Add timeout limit to document parsing job. DS4SD#270 (#552 ) Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com> Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 36.29 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 58.56 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 59.88 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling Usage: docling [OPTIONS] source ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx\|pptx\|html\|image\|pdf\|asciido Specify input formats to convert │ │ c\|md\|xlsx] from. Defaults to all formats. │ │ [default: None] │ │ --to [md\|json\|html\|text\|doctags] Specify output formats. Defaults to │ │ Markdown. │ │ [default: None] │ │ --image-export-mode [placeholder\|embedded\|referenced] Image export mode for the document │ │ (only in case of JSON, Markdown or │ │ HTML). With `placeholder`, only the │ │ position of the image is marked in │ │ the output. In `embedded` mode, the │ │ image is embedded as base64 encoded │ │ string. In `referenced` mode, the │ │ image is exported in PNG format and │ │ referenced from the main exported │ │ document. │ │ [default: embedded] │ │ --ocr --no-ocr If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR │ │ generated text over the full │ │ content. │ │ [default: no-force-ocr] │ │ --ocr-engine [easyocr\|tesseract_cli\|tesseract\| The OCR engine to use. │ │ ocrmac\|rapidocr] [default: easyocr] │ │ --ocr-lang TEXT Provide a comma-separated list of │ │ languages used by the OCR engine. │ │ Note that each OCR engine has │ │ different values for the language │ │ names. │ │ [default: None] │ │ --pdf-backend [pypdfium2\|dlparse_v1\|dlparse_v2] The PDF backend to use. │ │ [default: dlparse_v2] │ │ --table-mode [fast\|accurate] The mode to use in the table │ │ structure model. │ │ [default: fast] │ │ --artifacts-path PATH If provided, the location of the │ │ model artifacts. │ │ [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: no-abort-on-error] │ │ --output PATH Output directory where results are │ │ saved. │ │ [default: .] │ │ --verbose -v INTEGER Set the verbosity level. -v for │ │ info logging, -vv for debug │ │ logging. │ │ [default: 0] │ │ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which │ │ visualizes the PDF cells │ │ [default: no-debug-visualize-cells] │ │ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which │ │ visualizes the OCR cells │ │ [default: no-debug-visualize-ocr] │ │ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which │ │ visualizes the layour clusters │ │ [default: │ │ no-debug-visualize-layout] │ │ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which │ │ visualizes the table cells │ │ [default: │ │ no-debug-visualize-tables] │ │ --version Show version information. │ │ --document-timeout FLOAT The timeout for processing each │ │ document, in seconds. │ │ [default: None] │ │ --help Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯	2024-12-11 15:06:10 +01:00
Christoph Auer	aee9c0b324	fix: Do not import python modules from deepsearch-glm (#569 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-11 12:29:06 +01:00
Christoph Auer	f45499ce93	fix: Handle no result from RapidOcr reader (#558 ) Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-12-10 16:25:05 +01:00
Michele Dolfi	a7df337654	fix: make enum serializable with human-readable value (#555 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-12-10 13:12:44 +01:00
Christoph Auer	7972d47f88	fix: Call into docling-core for legacy document transform (#551 ) Call into docling-core for legacy document transform Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-09 17:06:47 +01:00
Nikos Livathinos	78f61a8522	fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544 ) * fix: main: Introduce format options for Image with the same pdf pipeline_options. Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Silence the tqdm messages during the downloading of model files Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Code styling Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Use the HF API to disable the tqdm progress bars Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-12-09 15:57:37 +01:00
Christoph Auer	aca57f0527	feat: docling-parse v2 as default PDF backend (#549 ) * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Upgrade to ds-glm 1.0 and docling-parse 3.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lock Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix DP2 backend code, change CLI default backend Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-09 13:26:17 +01:00
Panos Vagenas	c8ecdd987e	feat: expose new hybrid chunker, update docs (#384 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-09 08:28:29 +01:00
Maxim Lysak	eb7ffcdd1c	fix: Correcting DefaultText ID for MS Word backend (#537 ) Correcting DefaultText ID for MS Word backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 15:48:35 +01:00
Maxim Lysak	3e073dfbeb	feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534 ) Using style id instead of style names, which should be localization agnostic Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 15:17:56 +01:00
Sander Maijers	9102fe1adc	fix: Add `py.typed` marker file (#531 ) feat: add `py.typed` marker file See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>	2024-12-06 13:42:14 +01:00
Peter W. J. Staar	0d11e30dd8	fix: Enable HTML export in CLI and add options for image mode (#513 ) * updated README Signed-off-by: Peter Staar <taa@zurich.ibm.com> * removed duck in title Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the index.md Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the cli to export html Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added html to cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * removed the duck emoji, added the in the cli. Currently, the referenced seems broken Signed-off-by: Peter Staar <taa@zurich.ibm.com> * cleaning up the comments Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reference is now working Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Clean up styling and docs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin docling-core>=2.7.1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2024-12-06 12:37:57 +01:00
Maxim Lysak	b730b2d7a0	fix: Missing text in docx (t tag) when embedded in a table (#528 ) Fix for missing text in docx (t tag) when embedded in a table Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-12-06 12:37:25 +01:00
Michele Dolfi	8ada0bccc7	fix: folder input in cli (#511 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-12-04 14:22:00 +01:00
Christoph Auer	34c7c79858	fix: improve handling of disallowed formats (#429 ) * fix: Fixes and tests for StopIteration on .convert() Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Remove unnecessary case handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Other test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * improve handling of unsupported types - Introduced new explicit exception types instead of `RuntimeError` - Introduced new `ConversionStatus` value for unsupported formats - Tidied up converter member typing & removed asserts Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * robustify & simplify format option resolution Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * rename new status, populate ConversionResult errors Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-12-03 12:45:32 +01:00
guglie	c90c41c391	fix: ParserError EOF inside string (#470 ) (#472 ) Signed-off-by: guglie <gdguglie@gmail.com>	2024-12-03 11:21:18 +01:00
Panos Vagenas	051789d017	perf: prevent temp file leftovers, reuse core type (#487 ) * chore: reuse DocumentStream from docling-core Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * update docling-core version Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * [skip ci] document import line Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490) use new resolve_source_to_x functions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2024-12-03 10:40:28 +01:00
Gaspard Petit	d3f84b2457	fix: PermissionError when using tesseract_ocr_cli_model (#496 ) Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>	2024-12-03 10:22:03 +01:00
Michele Dolfi	d4872103b8	docs: add automatic api reference (#475 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-12-02 09:55:52 +01:00
Michele Dolfi	dd8de46267	fix(cli): expose debug options (#467 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-29 13:25:58 +01:00
Swaymaw	85b29990be	feat(ocr): added support for RapidOCR engine (#415 ) * adding rapidocr engine for ocr in docling Signed-off-by: swayam-singhal <swayam.singhal@inito.com> * fixing styling format Signed-off-by: Swaymaw <swaymaw@gmail.com> * updating pyproject.toml and poetry.lock to fix ci bugs Signed-off-by: Swaymaw <swaymaw@gmail.com> * help poetry pinning for python3.9 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * simplifying rapidocr options so that device can be changed using a single option for all models Signed-off-by: Swaymaw <swaymaw@gmail.com> * fix styling issues and small bug in rapidOcrOptions Signed-off-by: Swaymaw <swaymaw@gmail.com> * use default device until we enable global management Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: swayam-singhal <swayam.singhal@inito.com> Signed-off-by: Swaymaw <swaymaw@gmail.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: swayam-singhal <swayam.singhal@inito.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-27 13:57:41 +01:00
Manuel030	767563bf8b	fix: use correct image index in word backend (#442 ) * fix image index in word backend Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * fix: Fixes for wordx (#432) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated lxml dependency version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * sign dco Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> * correct rebase error Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> --------- Signed-off-by: Manuel030 <manuelenrique.plank@gmail.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-27 13:45:07 +01:00
Maxim Lysak	d0a1180478	fix: Fixes for wordx (#432 ) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated lxml dependency version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-26 14:44:43 +01:00
Michele Dolfi	7b013abcf3	fix: python3.9 support (#396 ) * fixes for python3.9 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse with python3.9 wheels Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update deps Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-20 15:21:40 +01:00
nuridol	6efa96c983	feat: add support for `ocrmac` OCR engine on macOS (#276 ) * feat: add support for `ocrmac` OCR engine on macOS - Integrates `ocrmac` as an OCR engine option for macOS users. - Adds configuration options and dependencies for `ocrmac`. - Updates documentation to reflect new engine support. This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * updated the poetry lock Signed-off-by: Suhwan Seo <nuridol@gmail.com> * Fix linting issues, update CLI docs, and add error for ocrmac use on non-Mac systems - Resolved formatting and linting issues - Updated `--ocr-engine` CLI option documentation for `ocrmac` - Added RuntimeError for attempts to use `ocrmac` on non-Mac platforms Signed-off-by: Suhwan Seo <nuridol@gmail.com> * feat: add support for `ocrmac` OCR engine on macOS - Integrates `ocrmac` as an OCR engine option for macOS users. - Adds configuration options and dependencies for `ocrmac`. - Updates documentation to reflect new engine support. This change allows macOS users to utilize `ocrmac` for improved OCR performance and compatibility. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * docs: update examples and installation for ocrmac support - Added `OcrMacOptions` to `custom_convert.py` and `full_page_ocr.py` examples. - Included usage comments and examples for `OcrMacOptions` in OCR pipelines. - Updated installation guide to include instructions for installing `ocrmac`, noting macOS version requirements (10.15+). - Highlighted that `ocrmac` leverages Apple's Vision framework as an OCR backend. This enhances documentation for users working on macOS to leverage `ocrmac` effectively. Signed-off-by: Suhwan Seo <nuridol@gmail.com> * fix: update `ocrmac` dependency with macOS-specific marker - Added `sys_platform == 'darwin'` marker to the `ocrmac` dependency in `pyproject.toml` to specify macOS compatibility. - Updated the content hash in `poetry.lock` to reflect the changes. This ensures the `ocrmac` dependency is only installed on macOS systems. Signed-off-by: Suhwan Seo <nuridol@gmail.com> --------- Signed-off-by: Suhwan Seo <nuridol@gmail.com> Co-authored-by: Suhwan Seo <nuridol@gmail.com>	2024-11-20 12:51:19 +01:00
Michele Dolfi	32ebf55e33	fix: propagate document limits to converter (#388 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-20 08:36:51 +01:00
Shubham Gupta	3f91e7d3f1	feat: added support for exporting DocItem to an image when page image is available (#379 ) * Updated minimum docling-core version to 2.4.0 Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> * Deprecated the generate_table_images option Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> * Updated examples to use get_image instead of element.image Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com> --------- Signed-off-by: Shubham Gupta <26436285+sh-gupta@users.noreply.github.com>	2024-11-19 16:28:52 +01:00
Michele Dolfi	ed785ea122	feat: expose ocr-lang in CLI (#375 ) * feat: expose ocr-lang in CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use regex for supporting multiple sep Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-19 15:58:49 +01:00
Peter W. J. Staar	926dfd29d5	feat: added excel backend (#334 ) * feat: added excel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first msexcel backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tooling for the cli Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working version for excel parsing of tables Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added proper typing for mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * refactor EXCEL to XLSX Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the unit tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran poetry lock Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding images to output [WIP] Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the msexcel (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added tests for merged cells in excel Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com>	2024-11-19 12:21:17 +01:00
Maxim Lysak	7a97d7119f	feat: Extracting picture data for raster images found in PPTX (#349 ) * Added picture data for pptx pictures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added tests for pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Inferring image DPI from pptx file Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-18 15:22:28 +01:00
Michele Dolfi	ca8524ecae	docs: add automatic generation of CLI reference (#325 ) * docs: add automatic generation of CLI reference Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * install deps for building CLI ref Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:17 +01:00
Maxim Lysak	8533039b0c	fix: Fixing images in the input Word files (#330 ) * Fixing images identification in the input Word files Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Populating extracted image data into docling picture for wordx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * removed base64 dependency in msword_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-14 13:33:34 +01:00
Michele Dolfi	8b437adcde	fix: reduce logging by keeping option for more verbose (#323 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 10:08:24 +01:00
Michele Dolfi	c9341bf22e	fix: skip glm model downloads (#322 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 08:45:28 +01:00
Maxim Lysak	fb8ba861e2	fix: Handling of single-cell tables in DOCX backend (#314 ) * Handling of single-cell tables in DOCX backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * returned try-catch on tables handling Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * proceed processing the content of single cell table as if its just part of the body Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added example of trickly 1 cell table docx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-12 15:20:55 +01:00
Christoph Auer	5d4a10b121	fix: Configure env prefix for docling settings (#315 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-11-12 10:57:16 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Maxim Lysak	81c8243a8b	fix: Added handling of grouped elements in pptx backend (#307 ) * Added handling of grouped elements in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * updated log.warn to warning Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 16:38:21 +01:00
Maxim Lysak	53bf2d1790	Added handling of code blocks in html with <pre> tag (#302 ) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>	2024-11-11 15:00:11 +01:00

... 2 3 4 5 6

264 Commits