Christoph Auer
7e84533299
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages ( #45 )
...
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-23 12:51:02 +02:00
github-actions[bot]
1930f08d4e
chore: bump version to 1.7.0 [skip ci]
2024-08-22 12:00:25 +00:00
Christoph Auer
a8c6b29a67
feat: Upgrade docling-parse PDF backend and interface to use page-by-page parsing ( #44 )
...
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 13:49:37 +02:00
github-actions[bot]
f7c50c8b0e
chore: bump version to 1.6.3 [skip ci]
2024-08-22 11:02:35 +00:00
Michele Dolfi
fac5745dc8
fix: usage of bytesio with docling-parse ( #43 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 12:59:49 +02:00
github-actions[bot]
1347c01a9e
chore: bump version to 1.6.2 [skip ci]
2024-08-22 07:32:54 +00:00
Michele Dolfi
69952682ed
fix: remove [ocr] extra to fix wheel install ( #42 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-22 09:25:19 +02:00
github-actions[bot]
47c6dab6d2
chore: bump version to 1.6.1 [skip ci]
2024-08-21 17:41:26 +00:00
Christoph Auer
f19871a5a1
fix: Add scipy as dependency ( #40 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:21:02 +02:00
Christoph Auer
4a1ceaf65c
Update docling-ibm-models to v1.1.2 ( #39 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-21 17:12:38 +02:00
github-actions[bot]
22a5c29c63
chore: bump version to 1.6.0 [skip ci]
2024-08-20 13:34:53 +00:00
Christoph Auer
e94d317c02
feat: Add adaptive OCR, factor out treatment of OCR areas and cell filtering ( #38 )
...
* Introduce adaptive OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Factor out BaseOcrModel, add docling-parse backend tests, fixes
* Make easyocr default dep
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 15:28:03 +02:00
github-actions[bot]
47b8ad917e
chore: bump version to 1.5.0 [skip ci]
2024-08-20 11:53:52 +00:00
Michele Dolfi
78347bf679
feat: allow computing page images on-demand with scale and cache them ( #36 )
...
* feat: allow computing page images on-demand and cache them
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: expose scale for export of page images and document elements
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix comment
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-20 13:27:19 +02:00
Christoph Auer
c253dd743a
Add redbooks to test data, small additions ( #35 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-20 12:36:00 +02:00
Michele Dolfi
a13114bafd
docs: add technical paper ref ( #37 )
...
* docs: add technical paper ref
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* use techreport bibtex type
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
---------
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-08-20 12:32:53 +02:00
github-actions[bot]
778e51ef18
chore: bump version to 1.4.0 [skip ci]
2024-08-14 11:46:55 +00:00
Michele Dolfi
349b0e914f
fix: allow newer torch versions ( #34 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 13:37:36 +02:00
Michele Dolfi
90dd676422
feat: update parser with bytesio interface and set as new default backend ( #32 )
...
* update parser with bytesio interface
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* change default backend
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update DEFAULT_BACKEND
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-14 12:30:00 +02:00
Christoph Auer
61be78a875
Fix class re-mapping for table of contents ( #33 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-08-14 11:32:30 +02:00
github-actions[bot]
dd0df9f094
chore: bump version to 1.3.0 [skip ci]
2024-08-12 16:29:05 +00:00
Michele Dolfi
63d80edca2
feat: output page images and extracted bbox ( #31 )
...
* Add assemble options and example saving pages and figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add options for different page elements, improve example and flip name of assemble_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-12 18:25:45 +02:00
github-actions[bot]
0bf4a43ed5
chore: bump version to 1.2.1 [skip ci]
2024-08-07 15:38:00 +00:00
Michele Dolfi
79ef8d2f2f
fix: update (vuln) deps ( #29 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:29:36 +02:00
Michele Dolfi
794b20a50a
fix: type of path_or_stream in PdfDocumentBackend ( #28 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:20:44 +02:00
Michele Dolfi
9550db8e64
docs: improve examples ( #27 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:16:35 +02:00
github-actions[bot]
20cbe7c24a
chore: bump version to 1.2.0 [skip ci]
2024-08-07 14:35:03 +00:00
Maxim Lysak
b8f5e38a8c
feat: introducing docling_backend ( #26 )
...
Uses our own docling_parse to reliably get PDF cells
To get page images, this backend uses pypdfium2
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-08-07 16:22:36 +02:00
github-actions[bot]
62ba4aaf31
chore: bump version to 1.1.2 [skip ci]
2024-07-31 12:35:59 +00:00
Panos Vagenas
d2d9543415
fix: set page number using 1-based indexing ( #22 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-31 14:28:44 +02:00
github-actions[bot]
e102827753
chore: bump version to 1.1.1 [skip ci]
2024-07-30 12:53:54 +00:00
Maxim Lysak
f4bf3d25b9
fix: Correct text extraction for table cells ( #21 )
...
* - Fixes for scaling transformation for table cell bounding boxes when using do_cell_matching = False
- Corrected examples/convert.py with appropriate parameter, for good quality example conversion
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
* Completed checks
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-07-30 14:51:47 +02:00
github-actions[bot]
b07c4a7a4a
chore: bump version to 1.1.0 [skip ci]
2024-07-26 15:01:56 +00:00
Panos Vagenas
d603137383
feat: add simplified single-doc conversion ( #20 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-26 16:55:33 +02:00
mara004
3eca8b8485
refactor(pypdfium2): just forward input to PdfDocument directly ( #17 )
...
PdfDocument() should do accept strings, paths, bytes and byte streams. If not, please file a bug report.
Signed-off-by: mara004 <geisserml@gmail.com>
2024-07-25 08:54:57 +02:00
github-actions[bot]
6db2b350dd
chore: bump version to 1.0.2 [skip ci]
2024-07-24 12:18:21 +00:00
Michele Dolfi
54b3dda141
fix: add easyocr to main deps for valid extra ( #19 )
...
* fix: add easyocr to main deps for valid extra
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove group
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 14:11:26 +02:00
github-actions[bot]
3e92f0bfba
chore: bump version to 1.0.1 [skip ci]
2024-07-24 09:28:47 +00:00
Michele Dolfi
b0725e0aa6
fix: expose ocr as extra ( #18 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-24 11:14:17 +02:00
github-actions[bot]
9f2add112f
chore: bump version to 1.0.0 [skip ci]
2024-07-18 15:52:38 +00:00
Michele Dolfi
71c3a9c8cd
feat!: v1.0.0 release ( #16 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:50:14 +02:00
Michele Dolfi
7bc20adc16
pin docling-ibm-models 1.1.0 with python 3.10 support ( #15 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:27:48 +02:00
Panos Vagenas
eb0b208272
chore: switch to docling-core Markdown export ( #14 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 16:10:05 +02:00
Panos Vagenas
28d1c746a6
chore: update README ( #13 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 11:23:23 +02:00
github-actions[bot]
f09ffcc8f4
chore: bump version to 0.4.0 [skip ci]
2024-07-17 14:26:50 +00:00
Christoph Auer
e9526bb11e
feat: Optimize table extraction quality, add configuration options ( #11 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-17 16:13:21 +02:00
github-actions[bot]
3e2ede8107
chore: bump version to 0.3.1 [skip ci]
2024-07-17 13:58:51 +00:00
Michele Dolfi
d1d1724537
fix: missing type for default values ( #12 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-17 15:54:43 +02:00
Panos Vagenas
2baa35c548
docs: reflect supported Python versions, add badges ( #10 )
...
* docs: reflect supported Python versions, add badges
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* minor HTML fix
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
---------
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-17 15:49:26 +02:00
github-actions[bot]
0dfa4548d3
chore: bump version to 0.3.0 [skip ci]
2024-07-17 12:11:15 +00:00