docs: Enrichment models (#1097)

* warning for develop examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add docs for enrichment models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * minor reorg of top-level docs (#1098) * minor reorg of top-level docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix typo [no ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * trigger ci Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-04 14:24:38 +01:00 · 2025-03-04 14:24:38 +01:00 · 357d41cc47
commit 357d41cc47
parent b1e79cadc7
10 changed files with 250 additions and 20 deletions
--- a/README.md
+++ b/README.md
@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig

 Docling has been brought to you by IBM.

-[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
+[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
 [docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
 [integrations]: https://ds4sd.github.io/docling/integrations/
--- a/docs/examples/develop_formula_understanding.py
+++ b/docs/examples/develop_formula_understanding.py
@ -1,3 +1,7 @@
+# WARNING
+# This example demonstrates only how to develop a new enrichment model.
+# It does not run the actual formula understanding model.
+
 import logging
 from pathlib import Path
 from typing import Iterable
--- a/docs/examples/develop_picture_enrichment.py
+++ b/docs/examples/develop_picture_enrichment.py
@ -1,3 +1,7 @@
+# WARNING
+# This example demonstrates only how to develop a new enrichment model.
+# It does not run the actual picture classifier model.
+
 import logging
 from pathlib import Path
 from typing import Any, Iterable
--- a/docs/faq/index.md
+++ b/docs/faq/index.md
@ -149,7 +149,7 @@ This is a collection of FAQ collected from the user questions on <https://github

    **Details**:

-    Using the [`HybridChunker`](./concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
+    Using the [`HybridChunker`](../concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
    > Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors

    This is a warning that is emitted by transformers, saying that actually *running this sequence through the model* will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
--- a/docs/index.md
+++ b/docs/index.md
@ -47,6 +47,6 @@ Docling simplifies document processing, parsing diverse formats — including ad

 Docling has been brought to you by IBM.

-[supported_formats]: ./supported_formats.md
+[supported_formats]: ./usage/supported_formats.md
 [docling_document]: ./concepts/docling_document.md
 [integrations]: ./integrations/index.md
--- a/docs/installation/index.md
+++ b/docs/installation/index.md
--- a/docs/usage/enrichments.md
+++ b/docs/usage/enrichments.md
@ -0,0 +1,216 @@
+Docling allows to enrich the conversion pipeline with additional steps which process specific document components,
+e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase
+the processing time consistently. For this reason most enrichment models are disabled by default.
+
+The following table provides an overview of the default enrichment models available in Docling.
+
+| Feature | Parameter | Processed item | Description |
+| ------- | --------- | ---------------| ----------- |
+| Code understanding | `do_code_enrichment` | `CodeItem` | See [docs below](#code-understanding). |
+| Formula understanding | `do_formula_enrichment` | `TextItem` with label `FORMULA` | See [docs below](#formula-understanding). |
+| Picrure classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
+| Picture description | `do_picture_description` | `PictureItem` | See [docs below](#picture-description). |
+
+
+## Enrichments details
+
+### Code understanding
+
+The code understanding step allows to use advance parsing for code blocks found in the document.
+This enrichment model also set the `code_language` property of the `CodeItem`.
+
+Model specs: see the [`CodeFormula` model card](https://huggingface.co/ds4sd/CodeFormula).
+
+Example command line:
+
+```sh
+docling --enrich-code FILE
+```
+
+Example code:
+
+```py
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+
+pipeline_options = PdfPipelineOptions()
+pipeline_options.do_code_enrichment = True
+
+converter = DocumentConverter(format_options={
+    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+})
+
+result = converter.convert("https://arxiv.org/pdf/2501.17887")
+doc = result.document
+```
+
+### Formula understanding
+
+The formula understanding step will analize the equation formulas in documents and extract their LaTeX representation.
+The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.
+
+Model specs: see the [`CodeFormula` model card](https://huggingface.co/ds4sd/CodeFormula).
+
+Example command line:
+
+```sh
+docling --enrich-formula FILE
+```
+
+Example code:
+
+```py
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+
+pipeline_options = PdfPipelineOptions()
+pipeline_options.do_formula_enrichment = True
+
+converter = DocumentConverter(format_options={
+    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+})
+
+result = converter.convert("https://arxiv.org/pdf/2501.17887")
+doc = result.document
+```
+
+### Picture classification
+
+The picture classification step classifies the `PictureItem` elements in the document with the `DocumentFigureClassifier` model.
+This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams,
+logos, signatures, etc.
+
+Model specs: see the [`DocumentFigureClassifier` model card](https://huggingface.co/ds4sd/DocumentFigureClassifier).
+
+Example command line:
+
+```sh
+docling --enrich-picture-classes FILE
+```
+
+Example code:
+
+```py
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+
+pipeline_options = PdfPipelineOptions()
+pipeline_options.generate_picture_images = True
+pipeline_options.images_scale = 2
+pipeline_options.do_picture_classification = True
+
+converter = DocumentConverter(format_options={
+    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+})
+
+result = converter.convert("https://arxiv.org/pdf/2501.17887")
+doc = result.document
+```
+
+
+### Picture description
+
+The picture description step allows to annotate a picture with a vision model. This is also known as a "captioning" task.
+The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template.
+Below follow a few examples on how to use some common vision model and remote services.
+
+
+```py
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.datamodel.base_models import InputFormat
+
+pipeline_options = PdfPipelineOptions()
+pipeline_options.do_picture_description = True
+
+converter = DocumentConverter(format_options={
+    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+})
+
+result = converter.convert("https://arxiv.org/pdf/2501.17887")
+doc = result.document
+
+```
+
+#### Granite Vision model
+
+Model specs: see the [`ibm-granite/granite-vision-3.1-2b-preview` model card](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview).
+
+Usage in Docling:
+
+```py
+from docling.datamodel.pipeline_options import granite_picture_description
+
+pipeline_options.picture_description_options = granite_picture_description
+```
+
+#### SmolVLM model
+
+Model specs: see the [`HuggingFaceTB/SmolVLM-256M-Instruct` model card](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct).
+
+Usage in Docling:
+
+```py
+from docling.datamodel.pipeline_options import smolvlm_picture_description
+
+pipeline_options.picture_description_options = smolvlm_picture_description
+```
+
+#### Other vision models
+
+The option class `PictureDescriptionVlmOptions` allows to use any another model from the Hugging Face Hub.
+
+```py
+from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions
+
+pipeline_options.picture_description_options = PictureDescriptionVlmOptions(
+    repo_id="",  # <-- add here the Hugging Face repo_id of your favorite VLM
+    prompt="Describe the image in three sentences. Be consise and accurate.",
+)
+```
+
+#### Remote vision model
+
+The option class `PictureDescriptionApiOptions` allows to use models hosted on remote platforms, e.g.
+on local endpoints served by [VLLM](https://docs.vllm.ai), [Ollama](https://ollama.com/) and others,
+or cloud providers like [IBM watsonx.ai](https://www.ibm.com/products/watsonx-ai), etc.
+
+_Note: in most cases this option will send your data to the remote service provider._
+
+Usage in Docling:
+
+```py
+from docling.datamodel.pipeline_options import PictureDescriptionApiOptions
+
+# Enable connections to remote services
+pipeline_options.enable_remote_services=True  # <-- this is required!
+
+# Example using a model running locally, e.g. via VLLM
+# $ vllm serve MODEL_NAME
+pipeline_options.picture_description_options = PictureDescriptionApiOptions(
+    url="http://localhost:8000/v1/chat/completions",
+    params=dict(
+        model="MODEL NAME",
+        seed=42,
+        max_completion_tokens=200,
+    ),
+    prompt="Describe the image in three sentences. Be consise and accurate.",
+    timeout=90,
+)
+```
+
+End-to-end code snippets for cloud providers are available in the examples section:
+
+- [IBM watsonx.ai](../examples/pictures_description_api.py)
+
+
+## Develop new enrichment models
+
+Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples
+dedicated to the implementation of enrichment models.
+
+- [Develop picture enrichment](../examples/develop_picture_enrichment.py)
+- [Develop formula enrichment](../examples/develop_formula_understanding.py)
--- a/docs/usage/index.md
+++ b/docs/usage/index.md
@ -22,7 +22,7 @@ A simple example would look like this:
 docling https://arxiv.org/pdf/2206.01062
 ```

-To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
+To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).

 ### Advanced options

@ -104,7 +104,7 @@ The options in this list require the explicit `enable_remote_services=True` when

 #### Adjust pipeline features

-The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
+The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
 one can adjust the conversion pipeline and features.

 ##### Control PDF table extraction options
@ -183,13 +183,13 @@ You can limit the CPU threads used by Docling by setting the environment variabl

 !!! note

-    This section discusses directly invoking a [backend](./concepts/architecture.md),
+    This section discusses directly invoking a [backend](../concepts/architecture.md),
    i.e. using a low-level API. This should only be done when necessary. For most cases,
    using a `DocumentConverter` (high-level API) as discussed in the sections above
    should suffice — and is the recommended way.

-By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
-You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
+By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](../supported_formats.md)).
+You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
 Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:

 ```python
@ -214,9 +214,9 @@ print(dl_doc.export_to_markdown())

 ## Chunking

-You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
+You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a
 `HybridChunker`, as shown below (for more details check out
-[this example](examples/hybrid_chunking.ipynb)):
+[this example](../examples/hybrid_chunking.ipynb)):

 ```python
 from docling.document_converter import DocumentConverter
--- a/docs/usage/supported_formats.md
+++ b/docs/usage/supported_formats.md
@ -1,6 +1,6 @@
 Docling can parse various documents formats into a unified representation (Docling
 Document), which it can export to different formats too — check out
-[Architecture](./concepts/architecture.md) for more details.
+[Architecture](../concepts/architecture.md) for more details.

 Below you can find a listing of all supported input and output formats.

@ -22,7 +22,7 @@ Schema-specific support:
 |--------|-------------|
 | USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
 | JATS XML | XML format followed by [JATS](https://jats.nlm.nih.gov/) articles |
-| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
+| Docling JSON | JSON-serialized [Docling Document](../concepts/docling_document.md) |

 ## Supported output formats

--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -54,11 +54,14 @@ theme:
 nav:
  - Home:
    - "Docling": index.md
-    - Installation: installation.md
-    - Usage: usage.md
-    - Supported formats: supported_formats.md
-    - FAQ: faq.md
-    - Docling v2: v2.md
+    - Installation:
+      - Installation: installation/index.md
+    - Usage:
+      - Usage: usage/index.md
+      - Supported formats: usage/supported_formats.md
+      - Enrichment features: usage/enrichments.md
+    - FAQ:
+      - FAQ: faq/index.md
  - Concepts:
    - Concepts: concepts/index.md
    - Architecture: concepts/architecture.md
@ -72,11 +75,8 @@ nav:
      - "Batch conversion": examples/batch_convert.py
      - "Multi-format conversion": examples/run_with_formats.py
      - "Figure export": examples/export_figures.py
-      - "Figure enrichment": examples/develop_picture_enrichment.py
      - "Table export": examples/export_tables.py
      - "Multimodal export": examples/export_multimodal.py
-      - "Annotate picture with local vlm": examples/pictures_description.ipynb
-      - "Annotate picture with remote vlm": examples/pictures_description_api.py
      - "Force full page OCR": examples/full_page_ocr.py
      - "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
      - "RapidOCR with custom OCR models": examples/rapidocr_with_custom_models.py
@ -90,6 +90,12 @@ nav:
      - examples/rag_haystack.ipynb
      - examples/rag_langchain.ipynb
      - examples/rag_llamaindex.ipynb
+    - 🖼️ Picture annotation:
+      - "Annotate picture with local VLM": examples/pictures_description.ipynb
+      - "Annotate picture with remote VLM": examples/pictures_description_api.py
+    - ✨ Enrichment development:
+      - "Figure enrichment": examples/develop_picture_enrichment.py
+      - "Formula enrichment": examples/develop_formula_understanding.py
    - 🗂️ More examples:
      - examples/rag_weaviate.ipynb
      - RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb