docs: Enrichment models (#1097)

* warning for develop examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for enrichment models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* minor reorg of top-level docs (#1098)

* minor reorg of top-level docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix typo [no ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* trigger ci

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
Michele Dolfi 2025-03-04 14:24:38 +01:00 committed by GitHub
parent b1e79cadc7
commit 357d41cc47
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 250 additions and 20 deletions

View File

@ -123,6 +123,6 @@ For individual model usage, please refer to the model licenses found in the orig
Docling has been brought to you by IBM.
[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
[integrations]: https://ds4sd.github.io/docling/integrations/

View File

@ -1,3 +1,7 @@
# WARNING
# This example demonstrates only how to develop a new enrichment model.
# It does not run the actual formula understanding model.
import logging
from pathlib import Path
from typing import Iterable

View File

@ -1,3 +1,7 @@
# WARNING
# This example demonstrates only how to develop a new enrichment model.
# It does not run the actual picture classifier model.
import logging
from pathlib import Path
from typing import Any, Iterable

View File

@ -149,7 +149,7 @@ This is a collection of FAQ collected from the user questions on <https://github
**Details**:
Using the [`HybridChunker`](./concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
Using the [`HybridChunker`](../concepts/chunking.md#hybrid-chunker) often triggers a warning like this:
> Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
This is a warning that is emitted by transformers, saying that actually *running this sequence through the model* will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.

View File

@ -47,6 +47,6 @@ Docling simplifies document processing, parsing diverse formats — including ad
Docling has been brought to you by IBM.
[supported_formats]: ./supported_formats.md
[supported_formats]: ./usage/supported_formats.md
[docling_document]: ./concepts/docling_document.md
[integrations]: ./integrations/index.md

216
docs/usage/enrichments.md Normal file
View File

@ -0,0 +1,216 @@
Docling allows to enrich the conversion pipeline with additional steps which process specific document components,
e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase
the processing time consistently. For this reason most enrichment models are disabled by default.
The following table provides an overview of the default enrichment models available in Docling.
| Feature | Parameter | Processed item | Description |
| ------- | --------- | ---------------| ----------- |
| Code understanding | `do_code_enrichment` | `CodeItem` | See [docs below](#code-understanding). |
| Formula understanding | `do_formula_enrichment` | `TextItem` with label `FORMULA` | See [docs below](#formula-understanding). |
| Picrure classification | `do_picture_classification` | `PictureItem` | See [docs below](#picture-classification). |
| Picture description | `do_picture_description` | `PictureItem` | See [docs below](#picture-description). |
## Enrichments details
### Code understanding
The code understanding step allows to use advance parsing for code blocks found in the document.
This enrichment model also set the `code_language` property of the `CodeItem`.
Model specs: see the [`CodeFormula` model card](https://huggingface.co/ds4sd/CodeFormula).
Example command line:
```sh
docling --enrich-code FILE
```
Example code:
```py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.do_code_enrichment = True
converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
result = converter.convert("https://arxiv.org/pdf/2501.17887")
doc = result.document
```
### Formula understanding
The formula understanding step will analize the equation formulas in documents and extract their LaTeX representation.
The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.
Model specs: see the [`CodeFormula` model card](https://huggingface.co/ds4sd/CodeFormula).
Example command line:
```sh
docling --enrich-formula FILE
```
Example code:
```py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.do_formula_enrichment = True
converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
result = converter.convert("https://arxiv.org/pdf/2501.17887")
doc = result.document
```
### Picture classification
The picture classification step classifies the `PictureItem` elements in the document with the `DocumentFigureClassifier` model.
This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams,
logos, signatures, etc.
Model specs: see the [`DocumentFigureClassifier` model card](https://huggingface.co/ds4sd/DocumentFigureClassifier).
Example command line:
```sh
docling --enrich-picture-classes FILE
```
Example code:
```py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.images_scale = 2
pipeline_options.do_picture_classification = True
converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
result = converter.convert("https://arxiv.org/pdf/2501.17887")
doc = result.document
```
### Picture description
The picture description step allows to annotate a picture with a vision model. This is also known as a "captioning" task.
The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template.
Below follow a few examples on how to use some common vision model and remote services.
```py
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.do_picture_description = True
converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
result = converter.convert("https://arxiv.org/pdf/2501.17887")
doc = result.document
```
#### Granite Vision model
Model specs: see the [`ibm-granite/granite-vision-3.1-2b-preview` model card](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview).
Usage in Docling:
```py
from docling.datamodel.pipeline_options import granite_picture_description
pipeline_options.picture_description_options = granite_picture_description
```
#### SmolVLM model
Model specs: see the [`HuggingFaceTB/SmolVLM-256M-Instruct` model card](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct).
Usage in Docling:
```py
from docling.datamodel.pipeline_options import smolvlm_picture_description
pipeline_options.picture_description_options = smolvlm_picture_description
```
#### Other vision models
The option class `PictureDescriptionVlmOptions` allows to use any another model from the Hugging Face Hub.
```py
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions
pipeline_options.picture_description_options = PictureDescriptionVlmOptions(
repo_id="", # <-- add here the Hugging Face repo_id of your favorite VLM
prompt="Describe the image in three sentences. Be consise and accurate.",
)
```
#### Remote vision model
The option class `PictureDescriptionApiOptions` allows to use models hosted on remote platforms, e.g.
on local endpoints served by [VLLM](https://docs.vllm.ai), [Ollama](https://ollama.com/) and others,
or cloud providers like [IBM watsonx.ai](https://www.ibm.com/products/watsonx-ai), etc.
_Note: in most cases this option will send your data to the remote service provider._
Usage in Docling:
```py
from docling.datamodel.pipeline_options import PictureDescriptionApiOptions
# Enable connections to remote services
pipeline_options.enable_remote_services=True # <-- this is required!
# Example using a model running locally, e.g. via VLLM
# $ vllm serve MODEL_NAME
pipeline_options.picture_description_options = PictureDescriptionApiOptions(
url="http://localhost:8000/v1/chat/completions",
params=dict(
model="MODEL NAME",
seed=42,
max_completion_tokens=200,
),
prompt="Describe the image in three sentences. Be consise and accurate.",
timeout=90,
)
```
End-to-end code snippets for cloud providers are available in the examples section:
- [IBM watsonx.ai](../examples/pictures_description_api.py)
## Develop new enrichment models
Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples
dedicated to the implementation of enrichment models.
- [Develop picture enrichment](../examples/develop_picture_enrichment.py)
- [Develop formula enrichment](../examples/develop_formula_understanding.py)

View File

@ -22,7 +22,7 @@ A simple example would look like this:
docling https://arxiv.org/pdf/2206.01062
```
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md).
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](../reference/cli.md).
### Advanced options
@ -104,7 +104,7 @@ The options in this list require the explicit `enable_remote_services=True` when
#### Adjust pipeline features
The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
one can adjust the conversion pipeline and features.
##### Control PDF table extraction options
@ -183,13 +183,13 @@ You can limit the CPU threads used by Docling by setting the environment variabl
!!! note
This section discusses directly invoking a [backend](./concepts/architecture.md),
This section discusses directly invoking a [backend](../concepts/architecture.md),
i.e. using a low-level API. This should only be done when necessary. For most cases,
using a `DocumentConverter` (high-level API) as discussed in the sections above
should suffice  and is the recommended way.
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](../supported_formats.md)).
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
```python
@ -214,9 +214,9 @@ print(dl_doc.export_to_markdown())
## Chunking
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
You can chunk a Docling document using a [chunker](../concepts/chunking.md), such as a
`HybridChunker`, as shown below (for more details check out
[this example](examples/hybrid_chunking.ipynb)):
[this example](../examples/hybrid_chunking.ipynb)):
```python
from docling.document_converter import DocumentConverter

View File

@ -1,6 +1,6 @@
Docling can parse various documents formats into a unified representation (Docling
Document), which it can export to different formats too — check out
[Architecture](./concepts/architecture.md) for more details.
[Architecture](../concepts/architecture.md) for more details.
Below you can find a listing of all supported input and output formats.
@ -22,7 +22,7 @@ Schema-specific support:
|--------|-------------|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
| JATS XML | XML format followed by [JATS](https://jats.nlm.nih.gov/) articles |
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
| Docling JSON | JSON-serialized [Docling Document](../concepts/docling_document.md) |
## Supported output formats

View File

@ -54,11 +54,14 @@ theme:
nav:
- Home:
- "Docling": index.md
- Installation: installation.md
- Usage: usage.md
- Supported formats: supported_formats.md
- FAQ: faq.md
- Docling v2: v2.md
- Installation:
- Installation: installation/index.md
- Usage:
- Usage: usage/index.md
- Supported formats: usage/supported_formats.md
- Enrichment features: usage/enrichments.md
- FAQ:
- FAQ: faq/index.md
- Concepts:
- Concepts: concepts/index.md
- Architecture: concepts/architecture.md
@ -72,11 +75,8 @@ nav:
- "Batch conversion": examples/batch_convert.py
- "Multi-format conversion": examples/run_with_formats.py
- "Figure export": examples/export_figures.py
- "Figure enrichment": examples/develop_picture_enrichment.py
- "Table export": examples/export_tables.py
- "Multimodal export": examples/export_multimodal.py
- "Annotate picture with local vlm": examples/pictures_description.ipynb
- "Annotate picture with remote vlm": examples/pictures_description_api.py
- "Force full page OCR": examples/full_page_ocr.py
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
- "RapidOCR with custom OCR models": examples/rapidocr_with_custom_models.py
@ -90,6 +90,12 @@ nav:
- examples/rag_haystack.ipynb
- examples/rag_langchain.ipynb
- examples/rag_llamaindex.ipynb
- 🖼️ Picture annotation:
- "Annotate picture with local VLM": examples/pictures_description.ipynb
- "Annotate picture with remote VLM": examples/pictures_description_api.py
- ✨ Enrichment development:
- "Figure enrichment": examples/develop_picture_enrichment.py
- "Formula enrichment": examples/develop_formula_understanding.py
- 🗂️ More examples:
- examples/rag_weaviate.ipynb
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb