Docling

Author	SHA1	Message	Date
Gaspard Petit	d3f84b2457	fix: PermissionError when using tesseract_ocr_cli_model (#496 ) Signed-off-by: Gaspard Petit <gaspardpetit@gmail.com>	2024-12-03 10:22:03 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Christoph Auer	2a2c65bf4f	feat: Add pipeline timings and toggle visualization, establish debug settings (#183 ) * Add settings to turn visualization on or off Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add profiling code to all models Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Refactor and fix profiling codes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Visualization codes output PNG to debug dir Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for time logging Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Optimize imports Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add start_timestamps to ProfilingItem Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-10-30 15:04:19 +01:00
Christoph Auer	a00c937e19	Ensure all models work only on valid pages (#158 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-10-18 08:54:06 +02:00
Christoph Auer	7d3be0edeb	feat!: Docling v2 (#117 ) --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-16 21:02:03 +02:00
Nikos Livathinos	dae2a3b667	fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138 ) * feat(OCR tests): Introduce fuzziness in the text validation of OCR tests Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-10-11 10:21:19 +02:00
Michele Dolfi	f96ea86a00	feat: add options for choosing OCR engines (#118 ) --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com>	2024-10-08 19:07:08 +02:00

7 Commits