Commit Graph

539 Commits

Author SHA1 Message Date
Michele Dolfi
63d80edca2
feat: output page images and extracted bbox (#31)
* Add assemble options and example saving pages and figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add options for different page elements, improve example and flip name of assemble_options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-12 18:25:45 +02:00
github-actions[bot]
0bf4a43ed5 chore: bump version to 1.2.1 [skip ci] 2024-08-07 15:38:00 +00:00
Michele Dolfi
79ef8d2f2f
fix: update (vuln) deps (#29)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:29:36 +02:00
Michele Dolfi
794b20a50a
fix: type of path_or_stream in PdfDocumentBackend (#28)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:20:44 +02:00
Michele Dolfi
9550db8e64
docs: improve examples (#27)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:16:35 +02:00
github-actions[bot]
20cbe7c24a chore: bump version to 1.2.0 [skip ci] 2024-08-07 14:35:03 +00:00
Maxim Lysak
b8f5e38a8c
feat: introducing docling_backend (#26)
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-08-07 16:22:36 +02:00
github-actions[bot]
62ba4aaf31 chore: bump version to 1.1.2 [skip ci] 2024-07-31 12:35:59 +00:00
Panos Vagenas
d2d9543415
fix: set page number using 1-based indexing (#22)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-31 14:28:44 +02:00
github-actions[bot]
e102827753 chore: bump version to 1.1.1 [skip ci] 2024-07-30 12:53:54 +00:00
Maxim Lysak
f4bf3d25b9
fix: Correct text extraction for table cells (#21)
* - Fixes for scaling transformation for table cell bounding boxes when using do_cell_matching = False
- Corrected examples/convert.py with appropriate parameter, for good quality example conversion

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>

* Completed checks

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-07-30 14:51:47 +02:00
github-actions[bot]
b07c4a7a4a chore: bump version to 1.1.0 [skip ci] 2024-07-26 15:01:56 +00:00
Panos Vagenas
d603137383
feat: add simplified single-doc conversion (#20)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-26 16:55:33 +02:00
mara004
3eca8b8485
refactor(pypdfium2): just forward input to PdfDocument directly (#17)
PdfDocument() should do accept strings, paths, bytes and byte streams. If not, please file a bug report.

Signed-off-by: mara004 <geisserml@gmail.com>
2024-07-25 08:54:57 +02:00
github-actions[bot]
6db2b350dd chore: bump version to 1.0.2 [skip ci] 2024-07-24 12:18:21 +00:00
Michele Dolfi
54b3dda141
fix: add easyocr to main deps for valid extra (#19)
* fix: add easyocr to main deps for valid extra

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove group

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 14:11:26 +02:00
github-actions[bot]
3e92f0bfba chore: bump version to 1.0.1 [skip ci] 2024-07-24 09:28:47 +00:00
Michele Dolfi
b0725e0aa6
fix: expose ocr as extra (#18)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 11:14:17 +02:00
github-actions[bot]
9f2add112f chore: bump version to 1.0.0 [skip ci] 2024-07-18 15:52:38 +00:00
Michele Dolfi
71c3a9c8cd
feat!: v1.0.0 release (#16)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:50:14 +02:00
Michele Dolfi
7bc20adc16
pin docling-ibm-models 1.1.0 with python 3.10 support (#15)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:27:48 +02:00
Panos Vagenas
eb0b208272
chore: switch to docling-core Markdown export (#14)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 16:10:05 +02:00
Panos Vagenas
28d1c746a6
chore: update README (#13)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 11:23:23 +02:00
github-actions[bot]
f09ffcc8f4 chore: bump version to 0.4.0 [skip ci] 2024-07-17 14:26:50 +00:00
Christoph Auer
e9526bb11e
feat: Optimize table extraction quality, add configuration options (#11)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-17 16:13:21 +02:00
github-actions[bot]
3e2ede8107 chore: bump version to 0.3.1 [skip ci] 2024-07-17 13:58:51 +00:00
Michele Dolfi
d1d1724537
fix: missing type for default values (#12)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-17 15:54:43 +02:00
Panos Vagenas
2baa35c548
docs: reflect supported Python versions, add badges (#10)
* docs: reflect supported Python versions, add badges

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor HTML fix

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-17 15:49:26 +02:00
github-actions[bot]
0dfa4548d3 chore: bump version to 0.3.0 [skip ci] 2024-07-17 12:11:15 +00:00
Michele Dolfi
fb72688ff7
feat: enable python 3.12 support by updating glm (#8)
* update deepsearch-glm for python 3.12 support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* enable python 3.12 in ci tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-17 14:03:26 +02:00
Christoph Auer
2803222ee1
docs: Add setup with pypi to Readme (#7)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-07-16 14:15:09 +02:00
github-actions[bot]
5c88574d03 chore: bump version to 0.2.0 [skip ci] 2024-07-16 11:37:14 +00:00
Michele Dolfi
b1479cf4ec
feat: build with ci (#6)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-16 13:34:42 +02:00
Michele Dolfi
b4f45ce96b
disable docs build (#5) 2024-07-16 13:14:44 +02:00
Michele Dolfi
e45dc5d1a5
ci: Add Github Actions (#4)
* add Github Actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply styling

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update .github/actions/setup-poetry/action.yml

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* add semantic-release config

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-16 13:05:04 +02:00
Christoph Auer
b9dc892385
Update convert.py (#3)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-07-15 18:02:42 +02:00
Christoph Auer
05ab89f958
doc: More documentation updates (#2)
* Update README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Bump version

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-07-15 14:59:53 +02:00
Christoph Auer
180f70c6e8
docs: Update links, add GH repository to metadata (#1)
* Add repo, absolute URLs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-07-15 12:43:05 +02:00
Christoph Auer
e2d996753b Initial commit 2024-07-15 09:42:42 +02:00