docs: add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas 2024-11-21 17:23:04 +01:00 committed by GitHub
parent 97d571af97
commit 7a45b92078
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
9 changed files with 56 additions and 17 deletions

View File

@ -1 +1 @@
Use the navigation on the left to browse some core Docling concepts. Use the navigation on the left to browse through some core Docling concepts.

View File

@ -7,13 +7,14 @@
[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869) [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/) [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/) [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev) [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
Docling parses documents and exports them to the desired format with ease and speed. Docling parses documents and exports them to the desired format with ease and speed.

View File

@ -0,0 +1,9 @@
Docling is available as a plugin for [EXAMPLE](https://example.com).
- 💻 [GitHub][github]
- 📖 [Docs][docs]
- 📦 [PyPI][pypi]
[github]: https://github.com/...
[docs]: https://...
[pypi]: https://pypi.org/project/...

View File

@ -1,13 +1,13 @@
## Get started ## Get started
Docling is used by the [Data Prep Kit \[↗\]](https://ibm.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale. Docling is used by the [Data Prep Kit](https://ibm.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
Below you find the Data Prep Kit modules powered by Docling. Below you find the Data Prep Kit modules powered by Docling.
## PDF ingestion to Parquet ## PDF ingestion to Parquet
- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet) - 💻 [PDF-to-Parquet GitHub](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet)
- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/) - 📖 [PDF-to-Parquet Docs](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/)
## Document chunking ## Document chunking
- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_chunk) - 💻 [Doc Chunking GitHub](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_chunk)
- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/) - 📖 [Doc Chunking Docs](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/)

View File

@ -0,0 +1,9 @@
Docling is available as a file conversion method in [DocETL](https://github.com/ucbepic/docetl):
- 💻 [DocETL GitHub][github]
- 📖 [DocETL Docs][docs]
- 📦 [DocETL PyPI][pypi]
[github]: https://github.com/ucbepic/docetl
[docs]: https://ucbepic.github.io/docetl/
[pypi]: https://pypi.org/project/docetl/

View File

@ -0,0 +1,9 @@
Docling is available in [Kotaemon](https://cinnamon.github.io/kotaemon/) as the `DoclingReader` loader:
- 💻 [Kotaemon GitHub][github]
- 📖 [DoclingReader Docs][docs]
- ⚙️ [Docling Setup in Kotaemon][setup]
[github]: https://github.com/Cinnamon/kotaemon
[docs]: https://cinnamon.github.io/kotaemon/reference/loaders/docling_loader/
[setup]: https://cinnamon.github.io/kotaemon/development/?h=docling#setup-multimodal-document-parsing-ocr-table-parsing-figure-extraction

View File

@ -1,8 +1,8 @@
## Get started ## Get started
Docling is available as an official [LlamaIndex \[↗\]](https://docs.llamaindex.ai/) extension. Docling is available as an official [LlamaIndex](https://docs.llamaindex.ai/) extension.
To get started, check out the [step-by-step guide in LlamaIndex \[↗\]](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/)<!--{target="_blank"}-->. To get started, check out the [step-by-step guide in LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/).
## Components ## Components
@ -10,16 +10,14 @@ To get started, check out the [step-by-step guide in LlamaIndex \[↗\]](https:/
Reads document files and uses Docling to populate LlamaIndex `Document` objects — either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown). Reads document files and uses Docling to populate LlamaIndex `Document` objects — either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).
- 💻 [GitHub \[↗\]](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-docling)<!--{target="_blank"}--> - 💻 [Docling Reader GitHub](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-docling)
- 📖 [API docs \[↗\]](https://docs.llamaindex.ai/en/stable/api_reference/readers/docling/)<!--{target="_blank"} --> - 📖 [Docling Reader Docs](https://docs.llamaindex.ai/en/stable/api_reference/readers/docling/)
- 📦 [PyPI \[↗\]](https://pypi.org/project/llama-index-readers-docling/)<!--{target="_blank"}--> - 📦 [Docling Reader PyPI](https://pypi.org/project/llama-index-readers-docling/)
- 🦙 [LlamaHub \[↗\]](https://llamahub.ai/l/readers/llama-index-readers-docling)<!--{target="_blank"}-->
### Docling Node Parser ### Docling Node Parser
Reads LlamaIndex `Document` objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex `Node` objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding. Reads LlamaIndex `Document` objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex `Node` objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.
- 💻 [GitHub \[↗\]](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/node_parser/llama-index-node-parser-docling)<!--{target="_blank"}--> - 💻 [Docling Node Parser GitHub](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/node_parser/llama-index-node-parser-docling)
- 📖 [API docs \[↗\]](https://docs.llamaindex.ai/en/stable/api_reference/node_parser/docling/)<!--{target="_blank"} --> - 📖 [Docling Node Parser Docs](https://docs.llamaindex.ai/en/stable/api_reference/node_parser/docling/)
- 📦 [PyPI \[↗\]](https://pypi.org/project/llama-index-node-parser-docling/)<!--{target="_blank"}--> - 📦 [Docling Node Parser PyPI](https://pypi.org/project/llama-index-node-parser-docling/)
- 🦙 [LlamaHub \[↗\]](https://llamahub.ai/l/node_parser/llama-index-node-parser-docling)<!--{target="_blank"}-->

View File

@ -0,0 +1,9 @@
Docling is available in [spaCy](https://spacy.io/) as the "SpaCy Layout" plugin:
- 💻 [SpacyLayout GitHub][github]
- 📖 [SpacyLayout Docs][docs]
- 📦 [SpacyLayout PyPI][pypi]
[github]: https://github.com/explosion/spacy-layout
[docs]: https://github.com/explosion/spacy-layout?tab=readme-ov-file#readme
[pypi]: https://pypi.org/project/spacy-layout/

View File

@ -38,6 +38,7 @@ theme:
- content.code.annotate - content.code.annotate
- content.code.copy - content.code.copy
- announce.dismiss - announce.dismiss
- navigation.footer
- navigation.tabs - navigation.tabs
- navigation.indexes # <= if set, each "section" can have its own page, if index.md is used - navigation.indexes # <= if set, each "section" can have its own page, if index.md is used
- navigation.instant - navigation.instant
@ -85,7 +86,10 @@ nav:
- Integrations: - Integrations:
- Integrations: integrations/index.md - Integrations: integrations/index.md
- "Data Prep Kit": integrations/data_prep_kit.md - "Data Prep Kit": integrations/data_prep_kit.md
- "DocETL": integrations/docetl.md
- "Kotaemon": integrations/kotaemon.md
- "LlamaIndex 🦙": integrations/llamaindex.md - "LlamaIndex 🦙": integrations/llamaindex.md
- "spaCy": integrations/spacy.md
# - "LangChain 🦜🔗": integrations/langchain.md # - "LangChain 🦜🔗": integrations/langchain.md
# - API reference: # - API reference:
# - API reference: api_reference/index.md # - API reference: api_reference/index.md