docs: document chunking (#111)

[skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00 · 2024-09-27 11:16:04 +02:00 · c05b692d69
commit c05b692d69
parent 6760571fe1
1 changed files with 22 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -207,6 +207,28 @@ results = doc_converter.convert(conv_input)

 You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.

+### Chunking
+
+You can perform a hierarchy-aware chunking of a Docling document as follows:
+
+```python
+from docling.document_converter import DocumentConverter
+from docling_core.transforms.chunker import HierarchicalChunker
+
+doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").output
+chunks = list(HierarchicalChunker().chunk(doc))
+# > [
+# >     ChunkWithMetadata(
+# >         path='$.main-text[0]',
+# >         text='DocLayNet: A Large Human-Annotated Dataset [...]',
+# >         page=1,
+# >         bbox=[107.30, 672.38, 505.19, 709.08]
+# >     ),
+# >     [...]
+# > ]
+```
+
+
 ## Technical report

 For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).