Docling

Author	SHA1	Message	Date
Michele Dolfi	6f1811e050	chore: fix placeholders in license (#63 ) Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2024-09-06 17:10:07 +02:00
github-actions[bot]	d3711437f6	chore: bump version to 1.9.0 [skip ci]	2024-09-03 13:33:40 +00:00
Michele Dolfi	1de2e4f924	feat: export document pages as multimodal output (#54 ) * feat: export document pages as multimodal output Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * create a single parquet output Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add loading into HF datasets library Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * cleanup Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-09-03 15:05:35 +02:00
Christoph Auer	69e5d951a3	docs: Update MAINTAINERS.md (#59 ) * docs: Update MAINTAINERS.md Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Update MAINTAINERS.md Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Update MAINTAINERS.md Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-09-02 12:34:38 +02:00
Christoph Auer	85b7348846	docs: Mention quackling on README (#58 ) Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>	2024-09-02 12:27:29 +02:00
github-actions[bot]	66ed096c40	chore: bump version to 1.8.5 [skip ci]	2024-08-30 12:37:54 +00:00
Peter W. J. Staar	48f4d1ba52	fix: Add unit tests (#51 ) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-30 14:08:20 +02:00
github-actions[bot]	256f4d504e	chore: bump version to 1.8.4 [skip ci]	2024-08-30 08:47:57 +00:00
Michele Dolfi	de85e46ced	fix: propagate row_section in tables (#57 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-30 10:36:00 +02:00
Michele Dolfi	a8a60d52b1	docs: add instructions for cpu-only installation (#56 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-30 10:20:21 +02:00
github-actions[bot]	5c46749e70	chore: bump version to 1.8.3 [skip ci]	2024-08-28 10:37:38 +00:00
Michele Dolfi	f49ee825c3	fix: table cells overlap and model warnings (#53 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-28 12:30:42 +02:00
github-actions[bot]	d0403aaebf	chore: bump version to 1.8.2 [skip ci]	2024-08-27 09:53:15 +00:00
Panos Vagenas	e46a66a176	fix: refine conversion result (#52 ) - fields `output` & `assembled` need not be optional - introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>	2024-08-27 11:50:43 +02:00
Michele Dolfi	fe817b11d7	docs: update interface in README (#50 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-26 15:36:39 +02:00
github-actions[bot]	7052bee999	chore: bump version to 1.8.1 [skip ci]	2024-08-26 11:55:37 +00:00
Michele Dolfi	8cc147bc56	fix: align output formats (#49 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-26 13:30:26 +02:00
github-actions[bot]	053eae4bdf	chore: bump version to 1.8.0 [skip ci]	2024-08-23 14:24:04 +00:00
Christoph Auer	a294b7e64a	feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47 ) * Put safety-checks for failed parse of pages Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Introduce page-level error checks Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bump to docling-parse 1.1.1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Introduce page-level error checks Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-23 16:18:41 +02:00
github-actions[bot]	3226b20779	chore: bump version to 1.7.1 [skip ci]	2024-08-23 11:56:02 +00:00
Christoph Auer	8808463cec	fix: Better raise exception when a page fails to parse (#46 ) * Put safety-checks for failed parse of pages Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bump to docling-parse 1.1.1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Raise from page backend if page is not correctly parsed Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-23 13:51:42 +02:00
Christoph Auer	7e84533299	fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45 ) * Put safety-checks for failed parse of pages Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bump to docling-parse 1.1.1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-23 12:51:02 +02:00
github-actions[bot]	1930f08d4e	chore: bump version to 1.7.0 [skip ci]	2024-08-22 12:00:25 +00:00
Christoph Auer	a8c6b29a67	feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44 ) * Use docling-parse page-by-page Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Propagate document_hash to PDF backends, use docling-parse 1.0.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Upgrade lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * repin after more packages on pypi Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-22 13:49:37 +02:00
github-actions[bot]	f7c50c8b0e	chore: bump version to 1.6.3 [skip ci]	2024-08-22 11:02:35 +00:00
Michele Dolfi	fac5745dc8	fix: usage of bytesio with docling-parse (#43 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-22 12:59:49 +02:00
github-actions[bot]	1347c01a9e	chore: bump version to 1.6.2 [skip ci]	2024-08-22 07:32:54 +00:00
Michele Dolfi	69952682ed	fix: remove [ocr] extra to fix wheel install (#42 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-22 09:25:19 +02:00
github-actions[bot]	47c6dab6d2	chore: bump version to 1.6.1 [skip ci]	2024-08-21 17:41:26 +00:00
Christoph Auer	f19871a5a1	fix: Add scipy as dependency (#40 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-21 17:21:02 +02:00
Christoph Auer	4a1ceaf65c	Update docling-ibm-models to v1.1.2 (#39 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-21 17:12:38 +02:00
github-actions[bot]	22a5c29c63	chore: bump version to 1.6.0 [skip ci]	2024-08-20 13:34:53 +00:00
Christoph Auer	e94d317c02	feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38 ) * Introduce adaptive OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Factor out BaseOcrModel, add docling-parse backend tests, fixes * Make easyocr default dep Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-20 15:28:03 +02:00
github-actions[bot]	47b8ad917e	chore: bump version to 1.5.0 [skip ci]	2024-08-20 11:53:52 +00:00
Michele Dolfi	78347bf679	feat: allow computing page images on-demand with scale and cache them (#36 ) * feat: allow computing page images on-demand and cache them Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * feat: expose scale for export of page images and document elements Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix comment Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-20 13:27:19 +02:00
Christoph Auer	c253dd743a	Add redbooks to test data, small additions (#35 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-20 12:36:00 +02:00
Michele Dolfi	a13114bafd	docs: add technical paper ref (#37 ) * docs: add technical paper ref Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * use techreport bibtex type Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>	2024-08-20 12:32:53 +02:00
github-actions[bot]	778e51ef18	chore: bump version to 1.4.0 [skip ci]	2024-08-14 11:46:55 +00:00
Michele Dolfi	349b0e914f	fix: allow newer torch versions (#34 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-14 13:37:36 +02:00
Michele Dolfi	90dd676422	feat: update parser with bytesio interface and set as new default backend (#32 ) * update parser with bytesio interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * change default backend Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update DEFAULT_BACKEND Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-14 12:30:00 +02:00
Christoph Auer	61be78a875	Fix class re-mapping for table of contents (#33 ) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>	2024-08-14 11:32:30 +02:00
github-actions[bot]	dd0df9f094	chore: bump version to 1.3.0 [skip ci]	2024-08-12 16:29:05 +00:00
Michele Dolfi	63d80edca2	feat: output page images and extracted bbox (#31 ) * Add assemble options and example saving pages and figures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add options for different page elements, improve example and flip name of assemble_options Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-12 18:25:45 +02:00
github-actions[bot]	0bf4a43ed5	chore: bump version to 1.2.1 [skip ci]	2024-08-07 15:38:00 +00:00
Michele Dolfi	79ef8d2f2f	fix: update (vuln) deps (#29 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-07 17:29:36 +02:00
Michele Dolfi	794b20a50a	fix: type of path_or_stream in PdfDocumentBackend (#28 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-07 17:20:44 +02:00
Michele Dolfi	9550db8e64	docs: improve examples (#27 ) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>	2024-08-07 17:16:35 +02:00
github-actions[bot]	20cbe7c24a	chore: bump version to 1.2.0 [skip ci]	2024-08-07 14:35:03 +00:00
Maxim Lysak	b8f5e38a8c	feat: introducing docling_backend (#26 ) Uses our own docling_parse to reliably get PDF cells To get page images, this backend uses pypdfium2 Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>	2024-08-07 16:22:36 +02:00
github-actions[bot]	62ba4aaf31	chore: bump version to 1.1.2 [skip ci]	2024-07-31 12:35:59 +00:00

1 2

81 Commits