Commit Graph

39 Commits

Author SHA1 Message Date
Panos Vagenas
84c46fdeb3
docs: extend integration docs & README (#456)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-28 09:41:21 +01:00
Peter W. J. Staar
2a1d3fd221
chore: update the README (#409)
* chore: update the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update README.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* chore: update the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:28:53 +01:00
Michele Dolfi
97d571af97
chore: add downloads in README, security policy and update ci actions (#401)
* add security policy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deprecated actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add comment about licenses for new dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pypi downloads badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add citation file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-21 13:59:45 +01:00
Michele Dolfi
7b013abcf3
fix: python3.9 support (#396)
* fixes for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse with python3.9 wheels

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 15:21:40 +01:00
Panos Vagenas
a84ec276b0
docs: update badges & credits (#248)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00
Panos Vagenas
5ce02c5c59
docs: add coming-soon section (#235)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:53:02 +01:00
Peter W. J. Staar
94a5290789
chore: update the with input formats and DoclingDocument (#188)
---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-30 15:02:28 +01:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format (#168)
* updated the base-model and added the asciidoc_backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the asciidoc backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Ensure all models work only on valid pages (#158)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* ci: run ci also on forks (#160)


---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* fix: fix legacy doc ref (#162)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* docs: typo fix (#155)

* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: add coverage_threshold to skip OCR for small images (#161)

* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.1.0 [skip ci]

* adding tests for asciidocs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working asciidoc parser

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding test_02.asciidoc

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Drafting Markdown backend via Marko library

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* work in progress on MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* md_backend produces docling document with headers, paragraphs, lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improvements in md parsing

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Detecting and assembling tables in markdown in temporary buffers

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added initial docling table support to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned code, improved logging for MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes MyPy requirements, and rest of pre-commit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed example run_md, added origin info to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* working on asciidocs, struggling with ImageRef

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* able to parse the captions and image uri's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update all backends with proper filename in DocumentOrigin

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update to docling-core v2.1.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for MD Backend, to avoid duplicated text inserts into docling doc

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix styling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added support for code blocks and fenced code in MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned prints

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added proper processing of in-line textual elements for MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issues with duplicated paragraphs and incorrect lists in pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00
Panos Vagenas
b8d2286dd1
chore: various minor docs fixes (#169)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-22 15:29:36 +02:00
ABHISHEK FADAKE
f799e777c1
docs: typo fix (#155)
* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:56:48 +02:00
Maxim Lysak
034a411057
docs: add graphical band in readme (#154)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 18:15:40 +02:00
Michele Dolfi
61c092f445
docs: add use docling (#150)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 (#117)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas
d504432c1e
docs: introduce docs site (#141)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support (#122)
* feat: windows support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add Windows in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Panos Vagenas
c05b692d69
docs: document chunking (#111)
[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice (#90)
* Support tableformer model choice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update datamodel structure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test unit for table options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Ensure import backwards-compatibility for PipelineOptions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust parameters on custom_convert

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Panos Vagenas
f8f2303348
docs: document CLI, minor README revamp (#100)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 09:21:28 +02:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core (#93)
* updated the render_as_doctags with the new arguments from docling-core

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the doctags tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix poetry lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fix formatting problems

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fixed the doctag export in docling/utils/export.py

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* propagate xsize and ysize

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-23 20:12:18 +02:00
Panos Vagenas
53569a1023
docs: showcase RAG with LlamaIndex and LangChain (#71)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-11 15:07:08 +02:00
Panos Vagenas
1051eb9465
chore: update README (#65)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-09 12:03:04 +02:00
Christoph Auer
85b7348846
docs: Mention quackling on README (#58)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-02 12:27:29 +02:00
Michele Dolfi
a8a60d52b1
docs: add instructions for cpu-only installation (#56)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 10:20:21 +02:00
Panos Vagenas
e46a66a176
fix: refine conversion result (#52)
- fields `output` & `assembled` need not be optional
- introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-08-27 11:50:43 +02:00
Michele Dolfi
fe817b11d7
docs: update interface in README (#50)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-26 15:36:39 +02:00
Michele Dolfi
a13114bafd
docs: add technical paper ref (#37)
* docs: add technical paper ref

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* use techreport bibtex type

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-08-20 12:32:53 +02:00
Michele Dolfi
9550db8e64
docs: improve examples (#27)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-07 17:16:35 +02:00
Panos Vagenas
d2d9543415
fix: set page number using 1-based indexing (#22)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-31 14:28:44 +02:00
Panos Vagenas
d603137383
feat: add simplified single-doc conversion (#20)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-26 16:55:33 +02:00
Michele Dolfi
7bc20adc16
pin docling-ibm-models 1.1.0 with python 3.10 support (#15)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-07-18 17:27:48 +02:00
Panos Vagenas
28d1c746a6
chore: update README (#13)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-18 11:23:23 +02:00
Christoph Auer
e9526bb11e
feat: Optimize table extraction quality, add configuration options (#11)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-17 16:13:21 +02:00
Panos Vagenas
2baa35c548
docs: reflect supported Python versions, add badges (#10)
* docs: reflect supported Python versions, add badges

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor HTML fix

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-17 15:49:26 +02:00
Christoph Auer
2803222ee1
docs: Add setup with pypi to Readme (#7)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-07-16 14:15:09 +02:00
Christoph Auer
05ab89f958
doc: More documentation updates (#2)
* Update README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Bump version

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-07-15 14:59:53 +02:00
Christoph Auer
180f70c6e8
docs: Update links, add GH repository to metadata (#1)
* Add repo, absolute URLs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bump version

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-07-15 12:43:05 +02:00
Christoph Auer
e2d996753b Initial commit 2024-07-15 09:42:42 +02:00