Commit Graph

81 Commits

Author SHA1 Message Date
Michele Dolfi
6f1811e050
chore: fix placeholders in license (#63)
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-09-06 17:10:07 +02:00
github-actions[bot]
d3711437f6 chore: bump version to 1.9.0 [skip ci] 2024-09-03 13:33:40 +00:00
Michele Dolfi
1de2e4f924
feat: export document pages as multimodal output (#54)
* feat: export document pages as multimodal output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* create a single parquet output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add loading into HF datasets library

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-03 15:05:35 +02:00
Christoph Auer
69e5d951a3
docs: Update MAINTAINERS.md (#59)
* docs: Update MAINTAINERS.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update MAINTAINERS.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update MAINTAINERS.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-02 12:34:38 +02:00
Christoph Auer
85b7348846
docs: Mention quackling on README (#58)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-02 12:27:29 +02:00
github-actions[bot]
66ed096c40 chore: bump version to 1.8.5 [skip ci] 2024-08-30 12:37:54 +00:00
Peter W. J. Staar
48f4d1ba52
fix: Add unit tests (#51)
* add the pytests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* renamed the test folder and added the toplevel test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the toplevel function test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to start running all tests successfully

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the reference converted documents

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added first test for json and md output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* replaced deprecated json function with model_dump_json

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix backend tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* commented out the drawing

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ci: avoid duplicate runs

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* commented out json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added verification of input cells

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformat code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added test to verify the cells in the pages (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* run all examples in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* make sure examples return failures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* raise a failure if examples fail

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* run examples after tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add tests and update top_level_tests using only datamodels

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unnecessary code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Validate conversion status on e2e test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* package verify utils and add more tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* reduce docs in example, since they are already in the tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip batch_convert

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse 1.1.2

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* updated the error messages

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* commented out the json verification for now

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* bumped GLM version

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin new docling-parse v1.1.3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
github-actions[bot]
256f4d504e chore: bump version to 1.8.4 [skip ci] 2024-08-30 08:47:57 +00:00
Michele Dolfi
de85e46ced
fix: propagate row_section in tables (#57)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 10:36:00 +02:00
Michele Dolfi
a8a60d52b1
docs: add instructions for cpu-only installation (#56)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 10:20:21 +02:00
github-actions[bot]
5c46749e70 chore: bump version to 1.8.3 [skip ci] 2024-08-28 10:37:38 +00:00
Michele Dolfi
f49ee825c3
fix: table cells overlap and model warnings (#53)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-28 12:30:42 +02:00
github-actions[bot]
d0403aaebf chore: bump version to 1.8.2 [skip ci] 2024-08-27 09:53:15 +00:00
Panos Vagenas
e46a66a176
fix: refine conversion result (#52)
- fields `output` & `assembled` need not be optional
- introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-08-27 11:50:43 +02:00
Michele Dolfi
fe817b11d7
docs: update interface in README (#50)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-26 15:36:39 +02:00
github-actions[bot]
7052bee999 chore: bump version to 1.8.1 [skip ci] 2024-08-26 11:55:37 +00:00
Michele Dolfi
8cc147bc56
fix: align output formats (#49)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-26 13:30:26 +02:00
github-actions[bot]
053eae4bdf chore: bump version to 1.8.0 [skip ci] 2024-08-23 14:24:04 +00:00
Christoph Auer
a294b7e64a
feat: Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce page-level error checks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 16:18:41 +02:00
github-actions[bot]
3226b20779 chore: bump version to 1.7.1 [skip ci] 2024-08-23 11:56:02 +00:00
Christoph Auer
8808463cec
fix: Better raise exception when a page fails to parse (#46)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Raise from page backend if page is not correctly parsed

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 13:51:42 +02:00
Christoph Auer
7e84533299
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45)
* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 12:51:02 +02:00
github-actions[bot]
1930f08d4e chore: bump version to 1.7.0 [skip ci] 2024-08-22 12:00:25 +00:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44)
* Use docling-parse page-by-page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Propagate document_hash to PDF backends, use docling-parse 1.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin after more packages on pypi

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 13:49:37 +02:00
github-actions[bot]
f7c50c8b0e chore: bump version to 1.6.3 [skip ci] 2024-08-22 11:02:35 +00:00
Michele Dolfi
fac5745dc8
fix: usage of bytesio with docling-parse (#43)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 12:59:49 +02:00
github-actions[bot]
1347c01a9e chore: bump version to 1.6.2 [skip ci] 2024-08-22 07:32:54 +00:00
Michele Dolfi
69952682ed
fix: remove [ocr] extra to fix wheel install (#42)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 09:25:19 +02:00
github-actions[bot]
47c6dab6d2 chore: bump version to 1.6.1 [skip ci] 2024-08-21 17:41:26 +00:00
Christoph Auer
f19871a5a1
fix: Add scipy as dependency (#40)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:21:02 +02:00
Christoph Auer
4a1ceaf65c
Update docling-ibm-models to v1.1.2 (#39)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:12:38 +02:00
github-actions[bot]
22a5c29c63 chore: bump version to 1.6.0 [skip ci] 2024-08-20 13:34:53 +00:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38)
* Introduce adaptive OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Factor out BaseOcrModel, add docling-parse backend tests, fixes

* Make easyocr default dep

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 15:28:03 +02:00
github-actions[bot]
47b8ad917e chore: bump version to 1.5.0 [skip ci] 2024-08-20 11:53:52 +00:00
Michele Dolfi
78347bf679
feat: allow computing page images on-demand with scale and cache them (#36)
* feat: allow computing page images on-demand and cache them

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: expose scale for export of page images and document elements

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix comment

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-20 13:27:19 +02:00
Christoph Auer
c253dd743a
Add redbooks to test data, small additions (#35)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 12:36:00 +02:00
Michele Dolfi
a13114bafd
docs: add technical paper ref (#37)
* docs: add technical paper ref

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* use techreport bibtex type

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-08-20 12:32:53 +02:00
github-actions[bot]
778e51ef18 chore: bump version to 1.4.0 [skip ci] 2024-08-14 11:46:55 +00:00
Michele Dolfi
349b0e914f
fix: allow newer torch versions (#34)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 13:37:36 +02:00
Michele Dolfi
90dd676422
feat: update parser with bytesio interface and set as new default backend (#32)
* update parser with bytesio interface

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* change default backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update DEFAULT_BACKEND

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 12:30:00 +02:00
Christoph Auer
61be78a875
Fix class re-mapping for table of contents (#33)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-14 11:32:30 +02:00
github-actions[bot]
dd0df9f094 chore: bump version to 1.3.0 [skip ci] 2024-08-12 16:29:05 +00:00
Michele Dolfi
63d80edca2
feat: output page images and extracted bbox (#31)
* Add assemble options and example saving pages and figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add options for different page elements, improve example and flip name of assemble_options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-12 18:25:45 +02:00
github-actions[bot]
0bf4a43ed5 chore: bump version to 1.2.1 [skip ci] 2024-08-07 15:38:00 +00:00
Michele Dolfi
79ef8d2f2f
fix: update (vuln) deps (#29)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:29:36 +02:00
Michele Dolfi
794b20a50a
fix: type of path_or_stream in PdfDocumentBackend (#28)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:20:44 +02:00
Michele Dolfi
9550db8e64
docs: improve examples (#27)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:16:35 +02:00
github-actions[bot]
20cbe7c24a chore: bump version to 1.2.0 [skip ci] 2024-08-07 14:35:03 +00:00
Maxim Lysak
b8f5e38a8c
feat: introducing docling_backend (#26)
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-08-07 16:22:36 +02:00
github-actions[bot]
62ba4aaf31 chore: bump version to 1.1.2 [skip ci] 2024-07-31 12:35:59 +00:00