docs: extend chunking docs, add FAQ on token limit (#1053)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-02-25 13:07:38 +01:00 · 2025-02-25 13:07:38 +01:00 · c84b973959
commit c84b973959
parent 1b0ead6907
2 changed files with 50 additions and 1 deletions
--- a/docs/concepts/chunking.md
+++ b/docs/concepts/chunking.md
@ -1,5 +1,18 @@
 ## Introduction

+!!! note "Chunking approaches"
+
+    Starting from a `DoclingDocument`, there are in principle two possible chunking
+    approaches:
+
+    1. exporting the `DoclingDocument` to Markdown (or similar format) and then
+      performing user-defined chunking as a post-processing step, or
+    2. using native Docling chunkers, i.e. operating directly on the `DoclingDocument`
+
+    This page is about the latter, i.e. using native Docling chunkers.
+    For an example of using approach (1) check out e.g.
+    [this recipe](../examples/rag_langchain.ipynb) looking at the Markdown export mode.
+
 A *chunker* is a Docling abstraction that, given a
 [`DoclingDocument`](./docling_document.md), returns a stream of chunks, each of which
 captures some part of the document as a string accompanied by respective metadata.
--- a/docs/faq.md
+++ b/docs/faq.md
@ -132,9 +132,45 @@ This is a collection of FAQ collected from the user questions on <https://github
    ```


-??? Some images are missing from MS Word and Powerpoint"
+??? question "Some images are missing from MS Word and Powerpoint"

    ### Some images are missing from MS Word and Powerpoint

    The image processing library used by Docling is able to handle embedded WMF images only on Windows platform.
    If you are on other operaring systems, these images will be ignored.
+
+
+??? question "`HybridChunker` triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'"
+
+    ### `HybridChunker` triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'
+
+    **TLDR**:
+    In the context of the `HybridChunker`, this is a known & ancitipated "false alarm".
+
+    **Details**:
+
+    Using the [`HybridChunker`](./concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
+    > Token indices sequence length is longer than the specified maximum sequence length for this model (530 > 512). Running this sequence through the model will result in indexing errors
+
+    This is a warning that is emitted by transformers, saying that actually *running this sequence through the model* will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
+
+    In our case though, this occurs as a "false alarm", since what happens is the following:
+
+    - the chunker invokes the tokenizer on a potentially long sequence (e.g. 530 tokens as mentioned in the warning) in order to count its tokens, i.e. to assess if it is short enough. At this point transformers already emits the warning above!
+    - whenever the sequence at hand is oversized, the chunker proceeds to split it (but the transformers warning has already been shown nonetheless)
+
+    What is important is the actual token length of the produced chunks.
+    The snippet below can be used for getting the actual maximum chunk size (for users wanting to confirm that this does not exceed the model limit):
+
+    ```python
+    max_len = 0
+    for i, chunk in enumerate(chunks):
+        ser_txt = chunker.serialize(chunk=chunk)
+        ser_tokens = len(tokenizer.tokenize(ser_txt, max_len_length=None))
+        if ser_tokens > max_len:
+            max_len = ser_tokens
+        print(f"{i}\t{ser_tokens}\t{repr(ser_txt[:100])}...")
+    print(f"{max_len=}")
+    ```
+
+    Source: Issue [docling-core#119](https://github.com/DS4SD/docling-core/issues/119)