docs: update chunking usage docs, minor reorg (#550)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
a7df337654
commit
d0c9e8e508
@ -10,7 +10,7 @@ For each document format, the *document converter* knows which format-specific *
|
|||||||
|
|
||||||
The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.
|
The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.
|
||||||
|
|
||||||
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a *chunker*.
|
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a [*chunker*](./chunking.md).
|
||||||
|
|
||||||
For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
|
For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
|
||||||
|
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
# CLI Reference
|
# CLI reference
|
||||||
|
|
||||||
This page provides documentation for our command line tools.
|
This page provides documentation for our command line tools.
|
||||||
|
|
||||||
@ -6,4 +6,4 @@ This page provides documentation for our command line tools.
|
|||||||
:module: docling.cli.main
|
:module: docling.cli.main
|
||||||
:command: click_app
|
:command: click_app
|
||||||
:prog_name: docling
|
:prog_name: docling
|
||||||
:style: table
|
:style: table
|
@ -22,9 +22,7 @@ A simple example would look like this:
|
|||||||
docling https://arxiv.org/pdf/2206.01062
|
docling https://arxiv.org/pdf/2206.01062
|
||||||
```
|
```
|
||||||
|
|
||||||
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./cli.md).
|
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Advanced options
|
### Advanced options
|
||||||
|
|
||||||
@ -130,29 +128,37 @@ You can limit the CPU threads used by Docling by setting the environment variabl
|
|||||||
|
|
||||||
## Chunking
|
## Chunking
|
||||||
|
|
||||||
You can perform a hierarchy-aware chunking of a Docling document as follows:
|
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
||||||
|
`HybridChunker`, as shown below (for more details check out
|
||||||
|
[this example](examples/hybrid_chunking.ipynb)):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from docling.document_converter import DocumentConverter
|
from docling.document_converter import DocumentConverter
|
||||||
from docling_core.transforms.chunker import HierarchicalChunker
|
from docling.chunking import HybridChunker
|
||||||
|
|
||||||
conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
|
conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
|
||||||
doc = conv_res.document
|
doc = conv_res.document
|
||||||
chunks = list(HierarchicalChunker().chunk(doc))
|
|
||||||
|
|
||||||
print(chunks[30])
|
chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed
|
||||||
|
chunk_iter = chunker.chunk(doc)
|
||||||
|
```
|
||||||
|
|
||||||
|
An example chunk would look like this:
|
||||||
|
|
||||||
|
```python
|
||||||
|
print(list(chunk_iter)[11])
|
||||||
# {
|
# {
|
||||||
# "text": "Lately, new types of ML models for document-layout analysis have emerged [...]",
|
# "text": "In this paper, we present the DocLayNet dataset. [...]",
|
||||||
# "meta": {
|
# "meta": {
|
||||||
# "doc_items": [{
|
# "doc_items": [{
|
||||||
# "self_ref": "#/texts/40",
|
# "self_ref": "#/texts/28",
|
||||||
# "label": "text",
|
# "label": "text",
|
||||||
# "prov": [{
|
# "prov": [{
|
||||||
# "page_no": 2,
|
# "page_no": 2,
|
||||||
# "bbox": {"l": 317.06, "t": 325.81, "r": 559.18, "b": 239.97, ...},
|
# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
|
||||||
# }]
|
# }], ...,
|
||||||
# }],
|
# }, ...],
|
||||||
# "headings": ["2 RELATED WORK"],
|
# "headings": ["1 INTRODUCTION"],
|
||||||
# }
|
# }
|
||||||
# }
|
# }
|
||||||
```
|
```
|
||||||
|
19
mkdocs.yml
19
mkdocs.yml
@ -56,7 +56,6 @@ nav:
|
|||||||
- "Docling": index.md
|
- "Docling": index.md
|
||||||
- Installation: installation.md
|
- Installation: installation.md
|
||||||
- Usage: usage.md
|
- Usage: usage.md
|
||||||
- CLI: cli.md
|
|
||||||
- FAQ: faq.md
|
- FAQ: faq.md
|
||||||
- Docling v2: v2.md
|
- Docling v2: v2.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
@ -76,15 +75,12 @@ nav:
|
|||||||
- "Table export": examples/export_tables.py
|
- "Table export": examples/export_tables.py
|
||||||
- "Multimodal export": examples/export_multimodal.py
|
- "Multimodal export": examples/export_multimodal.py
|
||||||
- "Force full page OCR": examples/full_page_ocr.py
|
- "Force full page OCR": examples/full_page_ocr.py
|
||||||
|
- Chunking:
|
||||||
|
- "Hybrid chunking": examples/hybrid_chunking.ipynb
|
||||||
- RAG / QA:
|
- RAG / QA:
|
||||||
- "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
|
- "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
|
||||||
- "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
|
- "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
|
||||||
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
|
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
|
||||||
- Chunking:
|
|
||||||
- "Hybrid chunking": examples/hybrid_chunking.ipynb
|
|
||||||
# - Chunking: examples/chunking.md
|
|
||||||
# - CLI:
|
|
||||||
# - CLI: examples/cli.md
|
|
||||||
- Integrations:
|
- Integrations:
|
||||||
- Integrations: integrations/index.md
|
- Integrations: integrations/index.md
|
||||||
- "🐝 Bee": integrations/bee.md
|
- "🐝 Bee": integrations/bee.md
|
||||||
@ -99,10 +95,13 @@ nav:
|
|||||||
- "spaCy": integrations/spacy.md
|
- "spaCy": integrations/spacy.md
|
||||||
- "txtai": integrations/txtai.md
|
- "txtai": integrations/txtai.md
|
||||||
# - "LangChain 🦜🔗": integrations/langchain.md
|
# - "LangChain 🦜🔗": integrations/langchain.md
|
||||||
- API reference:
|
- Reference:
|
||||||
- Document Converter: api_reference/document_converter.md
|
- Python API:
|
||||||
- Pipeline options: api_reference/pipeline_options.md
|
- Document Converter: reference/document_converter.md
|
||||||
- Docling Document: api_reference/docling_document.md
|
- Pipeline options: reference/pipeline_options.md
|
||||||
|
- Docling Document: reference/docling_document.md
|
||||||
|
- CLI:
|
||||||
|
- CLI reference: reference/cli.md
|
||||||
|
|
||||||
markdown_extensions:
|
markdown_extensions:
|
||||||
- pymdownx.superfences
|
- pymdownx.superfences
|
||||||
|
Loading…
Reference in New Issue
Block a user