docs: document chunking (#111)

[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Panos Vagenas 2024-09-27 11:16:04 +02:00 committed by GitHub
parent 6760571fe1
commit c05b692d69
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -207,6 +207,28 @@ results = doc_converter.convert(conv_input)
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
### Chunking
You can perform a hierarchy-aware chunking of a Docling document as follows:
```python
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker
doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").output
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# > ChunkWithMetadata(
# > path='$.main-text[0]',
# > text='DocLayNet: A Large Human-Annotated Dataset [...]',
# > page=1,
# > bbox=[107.30, 672.38, 505.19, 709.08]
# > ),
# > [...]
# > ]
```
## Technical report
For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).