Docling

Author	SHA1	Message	Date
Michele Dolfi	ed785ea122	feat: expose ocr-lang in CLI (#375 ) * feat: expose ocr-lang in CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use regex for supporting multiple sep Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-19 15:58:49 +01:00
Michele Dolfi	ca8524ecae	docs: add automatic generation of CLI reference (#325 ) * docs: add automatic generation of CLI reference Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * install deps for building CLI ref Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-15 13:18:17 +01:00
Michele Dolfi	8b437adcde	fix: reduce logging by keeping option for more verbose (#323 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-13 10:08:24 +01:00
Nikos Livathinos	c6b3763ecb	feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290 ) - When the OCR is forced, any existing PDF cells are rejected. - Introduce the force-ocr cmd parameter in docling CLI. - Update unit tests. - Add the full_page_ocr.py example in mkdocs. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>	2024-11-12 09:46:14 +01:00
Michele Dolfi	40ad987303	feat: pdf backend, table mode as options and artifacts path (#203 ) * feat: add more options in the CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update CLI docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * expose artifacts-path as argument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-11-04 14:26:05 +01:00
Johnny Salazar	af323c04ef	fit: Specify encoding when writing output file (#214 ) Specify encoding when writing output file to avoid errors when default target encoding doesn't have all characters. utf8 seems like the most universal and supported encoding. Otherwise, the cli fails with encoding errors when input file contains unicode text (basically most files nowadays) and the target system has default encoding set to some one-byte charset like cp1252 Signed-off-by: Johnny Salazar <cepera.ang@gmail.com>	2024-11-04 14:24:13 +01:00
Christoph Auer	7d3be0edeb	feat!: Docling v2 (#117 ) --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-10-16 21:02:03 +02:00
Michele Dolfi	f96ea86a00	feat: add options for choosing OCR engines (#118 ) --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com>	2024-10-08 19:07:08 +02:00
Christoph Auer	d6df76f90b	feat: Support tableformer model choice (#90 ) * Support tableformer model choice Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update datamodel structure Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update docs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add test unit for table options Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Ensure import backwards-compatibility for PipelineOptions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update README Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Adjust parameters on custom_convert Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Update Dockerfile Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-09-26 21:37:08 +02:00
Panos Vagenas	d96b96c848	fix: fix OCR setting for pypdfium, minor refactor (#102 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-09-24 14:36:00 +02:00
Panos Vagenas	3c46e4266c	feat: add URL support to CLI (#99 ) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-09-24 08:47:53 +02:00
Michele Dolfi	2870fdc857	fix: CLI compatibility with python 3.10 and 3.11 (#79 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-09-16 12:32:45 +02:00
Peter W. J. Staar	98990784df	feat: add docling cli (#75 ) * chore: add simple convert script Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted all Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted all Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added default arg Signed-off-by: Peter Staar <taa@zurich.ibm.com> * use typer for the docling CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * describe output when saving Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add tests for CLI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add export options Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2024-09-13 14:03:09 +02:00

13 Commits