feat!: Docling v2 (#117)

--------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
parent d504432c1e
commit 7d3be0edeb
144 changed files with 15180 additions and 3828 deletions
@@ -0,0 +1,65 @@
+With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a 
+pydantic datatype, which can express several features common to documents, such as:
+
+* Text, Tables, Pictures, and more
+* Document hierarchy with sections and groups
+* Disambiguation between main body and headers, footers (furniture)
+* Layout information (i.e. bounding boxes) for all items, if available
+* Provenance information
+
+It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
+
+# Example document structures
+
+To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
+`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document 
+serialized as YAML, right side shows the corresponding visual parts in MS Word.
+
+## Basic structure
+
+A `DoclingDocument` exposes top-level fields for the document content, organized in two categories. 
+The first category is the _content items_, which are stored in these fields:
+
+- `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
+- `tables`: All tables, type `TableItem`. Can carry structure annotations.
+- `pictures`: All pictures, type `PictureItem`. Can carry structure annotations.
+- `key_value_items`: All key-value items.
+
+All of the above fields are lists and store items inheriting from the `DocItem` type. They can express different
+data structures depending on their type, and reference parents and children through JSON pointers.
+
+The second category is _content structure_, which is encapsualted in:
+
+- `body`: The root node of a tree-structure for the main document body
+- `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
+- `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)
+
+All of the above fields are only storing `NodeItem` instances, which reference children and parents 
+through JSON pointers. 
+
+The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
+in the tree.
+
+Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`). 
+
+![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png)
+
+## Grouping
+
+Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
+"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the 
+top-level `groups` field.
+
+![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png)
+
+## Tables
+
+TBD
+
+## Pictures
+
+TBD
+
+## Provenance
+
+TBD
@@ -4,12 +4,17 @@ import time
 from pathlib import Path
 from typing import Iterable

+import yaml
+
 from docling.datamodel.base_models import ConversionStatus
-from docling.datamodel.document import ConversionResult, DocumentConversionInput
+from docling.datamodel.document import ConversionResult
 from docling.document_converter import DocumentConverter

 _log = logging.getLogger(__name__)

+USE_V2 = True
+USE_LEGACY = False
+

 def export_documents(
    conv_results: Iterable[ConversionResult],
@@ -26,25 +31,53 @@ def export_documents(
            success_count += 1
            doc_filename = conv_res.input.file.stem

-            # Export Deep Search document JSON format:
-            with (output_dir / f"{doc_filename}.json").open(
-                "w", encoding="utf-8"
-            ) as fp:
-                fp.write(json.dumps(conv_res.render_as_dict()))
+            if USE_V2:
+                # Export Docling document format to JSON:
+                with (output_dir / f"{doc_filename}.json").open("w") as fp:
+                    fp.write(json.dumps(conv_res.document.export_to_dict()))

-            # Export Text format:
-            with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
-                fp.write(conv_res.render_as_text())
+                # Export Docling document format to YAML:
+                with (output_dir / f"{doc_filename}.yaml").open("w") as fp:
+                    fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))

-            # Export Markdown format:
-            with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
-                fp.write(conv_res.render_as_markdown())
+                # Export Docling document format to doctags:
+                with (output_dir / f"{doc_filename}.doctags.txt").open("w") as fp:
+                    fp.write(conv_res.document.export_to_document_tokens())

-            # Export Document Tags format:
-            with (output_dir / f"{doc_filename}.doctags").open(
-                "w", encoding="utf-8"
-            ) as fp:
-                fp.write(conv_res.render_as_doctags())
+                # Export Docling document format to markdown:
+                with (output_dir / f"{doc_filename}.md").open("w") as fp:
+                    fp.write(conv_res.document.export_to_markdown())
+
+                # Export Docling document format to text:
+                with (output_dir / f"{doc_filename}.txt").open("w") as fp:
+                    fp.write(conv_res.document.export_to_markdown(strict_text=True))
+
+            if USE_LEGACY:
+                # Export Deep Search document JSON format:
+                with (output_dir / f"{doc_filename}.legacy.json").open(
+                    "w", encoding="utf-8"
+                ) as fp:
+                    fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))
+
+                # Export Text format:
+                with (output_dir / f"{doc_filename}.legacy.txt").open(
+                    "w", encoding="utf-8"
+                ) as fp:
+                    fp.write(
+                        conv_res.legacy_document.export_to_markdown(strict_text=True)
+                    )
+
+                # Export Markdown format:
+                with (output_dir / f"{doc_filename}.legacy.md").open(
+                    "w", encoding="utf-8"
+                ) as fp:
+                    fp.write(conv_res.legacy_document.export_to_markdown())
+
+                # Export Document Tags format:
+                with (output_dir / f"{doc_filename}.legacy.doctags.txt").open(
+                    "w", encoding="utf-8"
+                ) as fp:
+                    fp.write(conv_res.legacy_document.export_to_doctags())

        elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:
            _log.info(
@@ -77,23 +110,24 @@ def main():
    ]

    # buf = BytesIO(Path("./test/data/2206.01062.pdf").open("rb").read())
-    # docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
+    # docs = [DocumentStream(name="my_doc.pdf", stream=buf)]
    # input = DocumentConversionInput.from_streams(docs)

    doc_converter = DocumentConverter()

-    input = DocumentConversionInput.from_paths(input_doc_paths)
-
    start_time = time.time()

-    conv_results = doc_converter.convert(input)
+    conv_results = doc_converter.convert_all(
+        input_doc_paths,
+        raises_on_error=False,  # to let conversion run through all and examine results at the end
+    )
    success_count, partial_success_count, failure_count = export_documents(
-        conv_results, output_dir=Path("./scratch")
+        conv_results, output_dir=Path("scratch")
    )

    end_time = time.time() - start_time

-    _log.info(f"All documents were converted in {end_time:.2f} seconds.")
+    _log.info(f"Document conversion complete in {end_time:.2f} seconds.")

    if failure_count > 0:
        raise RuntimeError(
@@ -2,72 +2,18 @@ import json
 import logging
 import time
 from pathlib import Path
-from typing import Iterable

-from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
-from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
-from docling.datamodel.base_models import ConversionStatus, PipelineOptions
-from docling.datamodel.document import ConversionResult, DocumentConversionInput
-from docling.datamodel.pipeline_options import (
-    TesseractCliOcrOptions,
-    TesseractOcrOptions,
-)
-from docling.document_converter import DocumentConverter
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption

 _log = logging.getLogger(__name__)


-def export_documents(
-    conv_results: Iterable[ConversionResult],
-    output_dir: Path,
-):
-    output_dir.mkdir(parents=True, exist_ok=True)
-
-    success_count = 0
-    failure_count = 0
-
-    for conv_res in conv_results:
-        if conv_res.status == ConversionStatus.SUCCESS:
-            success_count += 1
-            doc_filename = conv_res.input.file.stem
-
-            # Export Deep Search document JSON format:
-            with (output_dir / f"{doc_filename}.json").open(
-                "w", encoding="utf-8"
-            ) as fp:
-                fp.write(json.dumps(conv_res.render_as_dict()))
-
-            # Export Text format:
-            with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
-                fp.write(conv_res.render_as_text())
-
-            # Export Markdown format:
-            with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
-                fp.write(conv_res.render_as_markdown())
-
-            # Export Document Tags format:
-            with (output_dir / f"{doc_filename}.doctags").open(
-                "w", encoding="utf-8"
-            ) as fp:
-                fp.write(conv_res.render_as_doctags())
-
-        else:
-            _log.info(f"Document {conv_res.input.file} failed to convert.")
-            failure_count += 1
-
-    _log.info(
-        f"Processed {success_count + failure_count} docs, of which {failure_count} failed"
-    )
-
-    return success_count, failure_count
-
-
 def main():
    logging.basicConfig(level=logging.INFO)

-    input_doc_paths = [
-        Path("./tests/data/2206.01062.pdf"),
-    ]
+    input_doc_path = Path("./tests/data/2206.01062.pdf")

    ###########################################################################

@@ -101,14 +47,15 @@ def main():

    # Docling Parse without EasyOCR
    # -------------------------
-    pipeline_options = PipelineOptions()
+    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = False
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True

    doc_converter = DocumentConverter(
-        pipeline_options=pipeline_options,
-        pdf_backend=DoclingParseDocumentBackend,
+        format_options={
+            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+        }
    )

    # Docling Parse with EasyOCR
@@ -151,24 +98,32 @@ def main():

    ###########################################################################

-    # Define input files
-    input = DocumentConversionInput.from_paths(input_doc_paths)
-
    start_time = time.time()
-
-    conv_results = doc_converter.convert(input)
-    success_count, failure_count = export_documents(
-        conv_results, output_dir=Path("./scratch")
-    )
-
+    conv_result = doc_converter.convert(input_doc_path)
    end_time = time.time() - start_time

-    _log.info(f"All documents were converted in {end_time:.2f} seconds.")
+    _log.info(f"Document converted in {end_time:.2f} seconds.")

-    if failure_count > 0:
-        raise RuntimeError(
-            f"The example failed converting {failure_count} on {len(input_doc_paths)}."
-        )
+    ## Export results
+    output_dir = Path("scratch")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    doc_filename = conv_result.input.file.stem
+
+    # Export Deep Search document JSON format:
+    with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
+        fp.write(json.dumps(conv_result.document.export_to_dict()))
+
+    # Export Text format:
+    with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
+        fp.write(conv_result.document.export_to_text())
+
+    # Export Markdown format:
+    with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
+        fp.write(conv_result.document.export_to_markdown())
+
+    # Export Document Tags format:
+    with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
+        fp.write(conv_result.document.export_to_document_tokens())


 if __name__ == "__main__":
@@ -0,0 +1,100 @@
+import logging
+from pathlib import Path
+from typing import Any, Iterable
+
+from docling_core.types.doc import (
+    DoclingDocument,
+    NodeItem,
+    PictureClassificationClass,
+    PictureClassificationData,
+    PictureItem,
+)
+
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.models.base_model import BaseEnrichmentModel
+from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
+
+
+class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):
+    do_picture_classifer: bool = True
+
+
+class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):
+
+    def __init__(self, enabled: bool):
+        self.enabled = enabled
+
+    def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
+        return self.enabled and isinstance(element, PictureItem)
+
+    def __call__(
+        self, doc: DoclingDocument, element_batch: Iterable[NodeItem]
+    ) -> Iterable[Any]:
+        if not self.enabled:
+            return
+
+        for element in element_batch:
+            assert isinstance(element, PictureItem)
+
+            # uncomment this to interactively visualize the image
+            # element.image.pil_image.show()
+
+            element.annotations.append(
+                PictureClassificationData(
+                    provenance="example_classifier-0.0.1",
+                    predicted_classes=[
+                        PictureClassificationClass(class_name="dummy", confidence=0.42)
+                    ],
+                )
+            )
+
+            yield element
+
+
+class ExamplePictureClassifierPipeline(StandardPdfPipeline):
+
+    def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):
+        super().__init__(pipeline_options)
+        self.pipeline_options: ExamplePictureClassifierPipeline
+
+        self.enrichment_pipe = [
+            ExamplePictureClassifierEnrichmentModel(
+                enabled=pipeline_options.do_picture_classifer
+            )
+        ]
+
+    @classmethod
+    def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions:
+        return ExamplePictureClassifierPipelineOptions()
+
+
+def main():
+    logging.basicConfig(level=logging.INFO)
+
+    input_doc_path = Path("./tests/data/2206.01062.pdf")
+
+    pipeline_options = ExamplePictureClassifierPipelineOptions()
+    pipeline_options.images_scale = 2.0
+    pipeline_options.generate_picture_images = True
+
+    doc_converter = DocumentConverter(
+        format_options={
+            InputFormat.PDF: PdfFormatOption(
+                pipeline_cls=ExamplePictureClassifierPipeline,
+                pipeline_options=pipeline_options,
+            )
+        }
+    )
+    result = doc_converter.convert(input_doc_path)
+
+    for element, _level in result.document.iterate_items():
+        if isinstance(element, PictureItem):
+            print(
+                f"The model populated the `data` portion of picture {element.self_ref}:\n{element.annotations}"
+            )
+
+
+if __name__ == "__main__":
+    main()
@@ -1,17 +1,12 @@
 import logging
 import time
 from pathlib import Path
-from typing import Tuple

-from docling.datamodel.base_models import (
-    AssembleOptions,
-    ConversionStatus,
-    FigureElement,
-    PageElement,
-    TableElement,
-)
-from docling.datamodel.document import DocumentConversionInput
-from docling.document_converter import DocumentConverter
+from docling_core.types.doc import PictureItem, TableItem
+
+from docling.datamodel.base_models import FigureElement, InputFormat, Table
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption

 _log = logging.getLogger(__name__)

@@ -21,64 +16,64 @@ IMAGE_RESOLUTION_SCALE = 2.0
 def main():
    logging.basicConfig(level=logging.INFO)

-    input_doc_paths = [
-        Path("./tests/data/2206.01062.pdf"),
-    ]
-    output_dir = Path("./scratch")
-
-    input_files = DocumentConversionInput.from_paths(input_doc_paths)
+    input_doc_path = Path("./tests/data/2206.01062.pdf")
+    output_dir = Path("scratch")

    # Important: For operating with page images, we must keep them, otherwise the DocumentConverter
    # will destroy them for cleaning up memory.
-    # This is done by setting AssembleOptions.images_scale, which also defines the scale of images.
+    # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.
    # scale=1 correspond of a standard 72 DPI image
-    assemble_options = AssembleOptions()
-    assemble_options.images_scale = IMAGE_RESOLUTION_SCALE
+    # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched
+    # with the image field
+    pipeline_options = PdfPipelineOptions()
+    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
+    pipeline_options.generate_page_images = True
+    pipeline_options.generate_table_images = True
+    pipeline_options.generate_picture_images = True

-    doc_converter = DocumentConverter(assemble_options=assemble_options)
+    doc_converter = DocumentConverter(
+        format_options={
+            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+        }
+    )

    start_time = time.time()

-    conv_results = doc_converter.convert(input_files)
+    conv_res = doc_converter.convert(input_doc_path)

-    success_count = 0
-    failure_count = 0
    output_dir.mkdir(parents=True, exist_ok=True)
-    for conv_res in conv_results:
-        if conv_res.status != ConversionStatus.SUCCESS:
-            _log.info(f"Document {conv_res.input.file} failed to convert.")
-            failure_count += 1
-            continue
+    doc_filename = conv_res.input.file.stem

-        doc_filename = conv_res.input.file.stem
+    # Save page images
+    for page_no, page in conv_res.document.pages.items():
+        page_no = page.page_no
+        page_image_filename = output_dir / f"{doc_filename}-{page_no}.png"
+        with page_image_filename.open("wb") as fp:
+            page.image.pil_image.save(fp, format="PNG")

-        # Export page images
-        for page in conv_res.pages:
-            page_no = page.page_no + 1
-            page_image_filename = output_dir / f"{doc_filename}-{page_no}.png"
-            with page_image_filename.open("wb") as fp:
-                page.image.save(fp, format="PNG")
-
-        # Export figures and tables
-        for element, image in conv_res.render_element_images(
-            element_types=(FigureElement, TableElement)
-        ):
+    # Save images of figures and tables
+    table_counter = 0
+    picture_counter = 0
+    for element, _level in conv_res.document.iterate_items():
+        if isinstance(element, TableItem):
+            table_counter += 1
            element_image_filename = (
-                output_dir / f"{doc_filename}-element-{element.id}.png"
+                output_dir / f"{doc_filename}-table-{table_counter}.png"
            )
            with element_image_filename.open("wb") as fp:
-                image.save(fp, "PNG")
+                element.image.pil_image.save(fp, "PNG")

-        success_count += 1
+        if isinstance(element, PictureItem):
+            picture_counter += 1
+            element_image_filename = (
+                output_dir / f"{doc_filename}-picture-{picture_counter}.png"
+            )
+            with element_image_filename.open("wb") as fp:
+                element.image.pil_image.save(fp, "PNG")

    end_time = time.time() - start_time

-    _log.info(f"All documents were converted in {end_time:.2f} seconds.")
-
-    if failure_count > 0:
-        raise RuntimeError(
-            f"The example failed converting {failure_count} on {len(input_doc_paths)}."
-        )
+    _log.info(f"Document converted and figures exported in {end_time:.2f} seconds.")


 if __name__ == "__main__":
@@ -5,10 +5,11 @@ from pathlib import Path

 import pandas as pd

-from docling.datamodel.base_models import AssembleOptions, ConversionStatus
-from docling.datamodel.document import DocumentConversionInput
-from docling.document_converter import DocumentConverter
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.document_converter import DocumentConverter, PdfFormatOption
 from docling.utils.export import generate_multimodal_pages
+from docling.utils.utils import create_hash

 _log = logging.getLogger(__name__)

@@ -18,71 +19,66 @@ IMAGE_RESOLUTION_SCALE = 2.0
 def main():
    logging.basicConfig(level=logging.INFO)

-    input_doc_paths = [
-        Path("./tests/data/2206.01062.pdf"),
-    ]
-    output_dir = Path("./scratch")
-
-    input_files = DocumentConversionInput.from_paths(input_doc_paths)
+    input_doc_path = Path("./tests/data/2206.01062.pdf")
+    output_dir = Path("scratch")

    # Important: For operating with page images, we must keep them, otherwise the DocumentConverter
    # will destroy them for cleaning up memory.
    # This is done by setting AssembleOptions.images_scale, which also defines the scale of images.
    # scale=1 correspond of a standard 72 DPI image
-    assemble_options = AssembleOptions()
-    assemble_options.images_scale = IMAGE_RESOLUTION_SCALE
+    pipeline_options = PdfPipelineOptions()
+    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
+    pipeline_options.generate_page_images = True

-    doc_converter = DocumentConverter(assemble_options=assemble_options)
+    doc_converter = DocumentConverter(
+        format_options={
+            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+        }
+    )

    start_time = time.time()

-    converted_docs = doc_converter.convert(input_files)
+    conv_res = doc_converter.convert(input_doc_path)

-    success_count = 0
-    failure_count = 0
    output_dir.mkdir(parents=True, exist_ok=True)
-    for doc in converted_docs:
-        if doc.status != ConversionStatus.SUCCESS:
-            _log.info(f"Document {doc.input.file} failed to convert.")
-            failure_count += 1
-            continue

-        rows = []
-        for (
-            content_text,
-            content_md,
-            content_dt,
-            page_cells,
-            page_segments,
-            page,
-        ) in generate_multimodal_pages(doc):
+    rows = []
+    for (
+        content_text,
+        content_md,
+        content_dt,
+        page_cells,
+        page_segments,
+        page,
+    ) in generate_multimodal_pages(conv_res):

-            dpi = page._default_image_scale * 72
+        dpi = page._default_image_scale * 72

-            rows.append(
-                {
-                    "document": doc.input.file.name,
-                    "hash": doc.input.document_hash,
-                    "page_hash": page.page_hash,
-                    "image": {
-                        "width": page.image.width,
-                        "height": page.image.height,
-                        "bytes": page.image.tobytes(),
-                    },
-                    "cells": page_cells,
-                    "contents": content_text,
-                    "contents_md": content_md,
-                    "contents_dt": content_dt,
-                    "segments": page_segments,
-                    "extra": {
-                        "page_num": page.page_no + 1,
-                        "width_in_points": page.size.width,
-                        "height_in_points": page.size.height,
-                        "dpi": dpi,
-                    },
-                }
-            )
-        success_count += 1
+        rows.append(
+            {
+                "document": conv_res.input.file.name,
+                "hash": conv_res.input.document_hash,
+                "page_hash": create_hash(
+                    conv_res.input.document_hash + ":" + str(page.page_no - 1)
+                ),
+                "image": {
+                    "width": page.image.width,
+                    "height": page.image.height,
+                    "bytes": page.image.tobytes(),
+                },
+                "cells": page_cells,
+                "contents": content_text,
+                "contents_md": content_md,
+                "contents_dt": content_dt,
+                "segments": page_segments,
+                "extra": {
+                    "page_num": page.page_no + 1,
+                    "width_in_points": page.size.width,
+                    "height_in_points": page.size.height,
+                    "dpi": dpi,
+                },
+            }
+        )

    # Generate one parquet from all documents
    df = pd.json_normalize(rows)
@@ -92,12 +88,9 @@ def main():

    end_time = time.time() - start_time

-    _log.info(f"All documents were converted in {end_time:.2f} seconds.")
-
-    if failure_count > 0:
-        raise RuntimeError(
-            f"The example failed converting {failure_count} on {len(input_doc_paths)}."
-        )
+    _log.info(
+        f"Document converted and multimodal pages generated in {end_time:.2f} seconds."
+    )

    # This block demonstrates how the file can be opened with the HF datasets library
    # from datasets import Dataset
@@ -1,12 +1,9 @@
 import logging
 import time
 from pathlib import Path
-from typing import Tuple

 import pandas as pd

-from docling.datamodel.base_models import ConversionStatus
-from docling.datamodel.document import DocumentConversionInput
 from docling.document_converter import DocumentConverter

 _log = logging.getLogger(__name__)
@@ -15,59 +12,39 @@ _log = logging.getLogger(__name__)
 def main():
    logging.basicConfig(level=logging.INFO)

-    input_doc_paths = [
-        Path("./tests/data/2206.01062.pdf"),
-    ]
-    output_dir = Path("./scratch")
-
-    input_files = DocumentConversionInput.from_paths(input_doc_paths)
+    input_doc_path = Path("./tests/data/2206.01062.pdf")
+    output_dir = Path("scratch")

    doc_converter = DocumentConverter()

    start_time = time.time()

-    conv_results = doc_converter.convert(input_files)
+    conv_res = doc_converter.convert(input_doc_path)

-    success_count = 0
-    failure_count = 0
    output_dir.mkdir(parents=True, exist_ok=True)
-    for conv_res in conv_results:
-        if conv_res.status != ConversionStatus.SUCCESS:
-            _log.info(f"Document {conv_res.input.file} failed to convert.")
-            failure_count += 1
-            continue

-        doc_filename = conv_res.input.file.stem
+    doc_filename = conv_res.input.file.stem

-        # Export tables
-        for table_ix, table in enumerate(conv_res.output.tables):
-            table_df: pd.DataFrame = table.export_to_dataframe()
-            print(f"## Table {table_ix}")
-            print(table_df.to_markdown())
+    # Export tables
+    for table_ix, table in enumerate(conv_res.document.tables):
+        table_df: pd.DataFrame = table.export_to_dataframe()
+        print(f"## Table {table_ix}")
+        print(table_df.to_markdown())

-            # Save the table as csv
-            element_csv_filename = output_dir / f"{doc_filename}-table-{table_ix+1}.csv"
-            _log.info(f"Saving CSV table to {element_csv_filename}")
-            table_df.to_csv(element_csv_filename)
+        # Save the table as csv
+        element_csv_filename = output_dir / f"{doc_filename}-table-{table_ix+1}.csv"
+        _log.info(f"Saving CSV table to {element_csv_filename}")
+        table_df.to_csv(element_csv_filename)

-            # Save the table as html
-            element_html_filename = (
-                output_dir / f"{doc_filename}-table-{table_ix+1}.html"
-            )
-            _log.info(f"Saving HTML table to {element_html_filename}")
-            with element_html_filename.open("w") as fp:
-                fp.write(table.export_to_html())
-
-        success_count += 1
+        # Save the table as html
+        element_html_filename = output_dir / f"{doc_filename}-table-{table_ix+1}.html"
+        _log.info(f"Saving HTML table to {element_html_filename}")
+        with element_html_filename.open("w") as fp:
+            fp.write(table.export_to_html())

    end_time = time.time() - start_time

-    _log.info(f"All documents were converted in {end_time:.2f} seconds.")
-
-    if failure_count > 0:
-        raise RuntimeError(
-            f"The example failed converting {failure_count} on {len(input_doc_paths)}."
-        )
+    _log.info(f"Document converted and tables exported in {end_time:.2f} seconds.")


 if __name__ == "__main__":
@@ -2,5 +2,9 @@ from docling.document_converter import DocumentConverter

 source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
 converter = DocumentConverter()
-doc = converter.convert_single(source)
-print(doc.render_as_markdown())  # output: ## Docling Technical Report [...]"
+result = converter.convert(source)
+print(
+    result.document.export_to_markdown()
+)  # output: ## Docling Technical Report [...]"
+# if the legacy output is needed, use this version
+# print(result.legacy_document.export_to_markdown())  # output: ## Docling Technical Report [...]"
@@ -49,18 +49,6 @@
    "load_dotenv()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import warnings\n",
-    "\n",
-    "warnings.filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic|torch\")\n",
-    "warnings.filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -86,54 +74,37 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
-    "from enum import Enum\n",
    "from typing import Iterator\n",
    "\n",
    "from langchain_core.document_loaders import BaseLoader\n",
    "from langchain_core.documents import Document as LCDocument\n",
-    "from pydantic import BaseModel\n",
    "\n",
    "from docling.document_converter import DocumentConverter\n",
    "\n",
-    "\n",
-    "class DocumentMetadata(BaseModel):\n",
-    "    dl_doc_hash: str\n",
-    "    # source: str\n",
-    "\n",
-    "\n",
    "class DoclingPDFLoader(BaseLoader):\n",
-    "    class ParseType(str, Enum):\n",
-    "        MARKDOWN = \"markdown\"\n",
-    "        # JSON = \"json\"\n",
    "\n",
-    "    def __init__(self, file_path: str | list[str], parse_type: ParseType) -> None:\n",
+    "    def __init__(self, file_path: str | list[str]) -> None:\n",
    "        self._file_paths = file_path if isinstance(file_path, list) else [file_path]\n",
-    "        self._parse_type = parse_type\n",
    "        self._converter = DocumentConverter()\n",
    "\n",
    "    def lazy_load(self) -> Iterator[LCDocument]:\n",
    "        for source in self._file_paths:\n",
-    "            dl_doc = self._converter.convert_single(source).output\n",
-    "            match self._parse_type:\n",
-    "                case self.ParseType.MARKDOWN:\n",
-    "                    text = dl_doc.export_to_markdown()\n",
-    "                # case self.ParseType.JSON:\n",
-    "                #     text = dl_doc.model_dump_json()\n",
-    "                case _:\n",
-    "                    raise RuntimeError(\n",
-    "                        f\"Unexpected parse type encountered: {self._parse_type}\"\n",
-    "                    )\n",
-    "            lc_doc = LCDocument(\n",
-    "                page_content=text,\n",
-    "                metadata=DocumentMetadata(\n",
-    "                    dl_doc_hash=dl_doc.file_info.document_hash,\n",
-    "                ).model_dump(),\n",
-    "            )\n",
-    "            yield lc_doc"
+    "            dl_doc = self._converter.convert(source).document\n",
+    "            text = dl_doc.export_to_markdown()\n",
+    "            yield LCDocument(page_content=text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FILE_PATH = \"https://raw.githubusercontent.com/DS4SD/docling/main/tests/data/2206.01062.pdf\"  # DocLayNet paper"
   ]
  },
  {
@@ -141,37 +112,10 @@
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
-   "source": [
-    "FILE_PATH = \"https://arxiv.org/pdf/2206.01062\"  # DocLayNet paper"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "1b38d07d5fed4618a44ecf261e1e5c44",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
-    "loader = DoclingPDFLoader(\n",
-    "    file_path=FILE_PATH,\n",
-    "    parse_type=DoclingPDFLoader.ParseType.MARKDOWN,\n",
-    ")\n",
+    "loader = DoclingPDFLoader(file_path=FILE_PATH)\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000,\n",
    "    chunk_overlap=200,\n",
@@ -187,7 +131,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -204,7 +148,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -223,7 +167,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -232,7 +176,7 @@
    "from langchain_milvus import Milvus\n",
    "\n",
    "MILVUS_URI = os.environ.get(\n",
-    "    \"MILVUS_URL\", f\"{(tmp_dir := TemporaryDirectory()).name}/milvus_demo.db\"\n",
+    "    \"MILVUS_URI\", f\"{(tmp_dir := TemporaryDirectory()).name}/milvus_demo.db\"\n",
    ")\n",
    "\n",
    "vectorstore = Milvus.from_documents(\n",
@@ -252,7 +196,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@@ -287,7 +231,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -319,16 +263,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'The human annotation of DocLayNet was performed on 80863 pages.\\n\\nExplanation:\\nThe information is found in the paragraph \"DocLayNet contains 80863 PDF pages\" in the context.'"
+       "'- 80,863 pages were human annotated for DocLayNet.'"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -336,13 +280,6 @@
   "source": [
    "rag_chain.invoke(\"How many pages were human annotated for DocLayNet?\")"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
@@ -14,6 +14,13 @@
    "# RAG with LlamaIndex 🦙"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> ℹ️ 👉 **The LlamaIndex Docling extension update to Docling v2 is ongoing; in the meanwhile, this notebook is showing current extension output, based on Docling v1.**"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -0,0 +1,76 @@
+import json
+import logging
+from pathlib import Path
+
+import yaml
+
+from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import (
+    DocumentConverter,
+    PdfFormatOption,
+    WordFormatOption,
+)
+from docling.pipeline.simple_pipeline import SimplePipeline
+from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
+
+_log = logging.getLogger(__name__)
+
+
+def main():
+    input_paths = [
+        Path("tests/data/wiki_duck.html"),
+        Path("tests/data/word_sample.docx"),
+        Path("tests/data/lorem_ipsum.docx"),
+        Path("tests/data/powerpoint_sample.pptx"),
+        Path("tests/data/2305.03393v1-pg9-img.png"),
+        Path("tests/data/2206.01062.pdf"),
+    ]
+
+    ## for defaults use:
+    # doc_converter = DocumentConverter()
+
+    ## to customize use:
+
+    doc_converter = (
+        DocumentConverter(  # all of the below is optional, has internal defaults.
+            allowed_formats=[
+                InputFormat.PDF,
+                InputFormat.IMAGE,
+                InputFormat.DOCX,
+                InputFormat.HTML,
+                InputFormat.PPTX,
+            ],  # whitelist formats, non-matching files are ignored.
+            format_options={
+                InputFormat.PDF: PdfFormatOption(
+                    pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
+                ),
+                InputFormat.DOCX: WordFormatOption(
+                    pipeline_cls=SimplePipeline  # , backend=MsWordDocumentBackend
+                ),
+            },
+        )
+    )
+
+    conv_results = doc_converter.convert_all(input_paths)
+
+    for res in conv_results:
+        out_path = Path("scratch")
+        print(
+            f"Document {res.input.file.name} converted."
+            f"\nSaved markdown output to: {str(out_path)}"
+        )
+        # print(res.docdocument.export_to_markdown())
+        # Export Docling document format to markdowndoc:
+        with (out_path / f"{res.input.file.name}.md").open("w") as fp:
+            fp.write(res.document.export_to_markdown())
+
+        with (out_path / f"{res.input.file.name}.json").open("w") as fp:
+            fp.write(json.dumps(res.document.export_to_dict()))
+
+        with (out_path / f"{res.input.file.name}.yaml").open("w") as fp:
+            fp.write(yaml.safe_dump(res.document.export_to_dict()))
+
+
+if __name__ == "__main__":
+    main()
@@ -17,13 +17,13 @@
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
 [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)

-Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+Docling parses documents and exports them to the desired format with ease and speed.

 ## Features

-* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
-* 📑 Understands detailed page layout, reading order and recovers table structures
-* 📝 Extracts metadata from the document, such as title, authors, references and language
-* 🔍 Includes OCR support for scanned PDFs
-* 🤖 Integrates easily with LLM app / RAG frameworks like LlamaIndex&nbsp;🦙 & LangChain&nbsp;🦜🔗
-* 💻 Provides a simple and convenient CLI
+* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
+* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
+* 📝 Metadata extraction, including title, authors, references & language
+* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
+* 🔍 OCR support for scanned PDFs
+* 💻 Simple and convenient CLI
@@ -4,6 +4,10 @@ Docling is available as an official LlamaIndex extension!

 To get started, check out the [step-by-step guide in LlamaIndex \[↗\]](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/)<!--{target="_blank"}-->.

+!!! info "Docling v2"
+
+    The LlamaIndex Docling extension update to Docling v2 is ongoing.
+
 ## Components

 ### Docling Reader
@@ -1,7 +1,5 @@
 {% extends "base.html" %}

-{#
 {% block announce %}
-  <p>🎉 Docling is now officially supported in LlamaIndex! <a href="{{ 'integrations/llamaindex/' | url }}">Check it out</a>!</p>
+  <p>🎉 Docling has gone v2! <a href="{{ 'v2' | url }}">Check out</a> what's new and how to get started!</p>
 {% endblock %}
-#}
@@ -0,0 +1,213 @@
+## What's new
+
+Docling v2 introduces several new features:
+
+- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats 
+- Produces a new, universal document representation which can encapsulate document hierarchy
+- Comes with a fresh new API and CLI
+
+## Changes in Docling v2
+
+### CLI
+
+We updated the command line syntax of Docling v2 to support many formats. Examples are seen below.
+```shell
+# Convert a single file to Markdown (default)
+docling myfile.pdf
+
+# Convert a single file to Markdown and JSON, without OCR
+docling myfile.pdf --to json --to md --no-ocr
+
+# Convert PDF files in input directory to Markdown (default)
+docling ./input/dir --from pdf
+
+# Convert PDF and Word files in input directory to Markdown and JSON
+docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch  
+
+# Convert all supported files in input directory to Markdown, but abort on first error
+docling ./input/dir --output ./scratch --abort-on-error
+
+```
+
+**Notable changes from Docling v1:**
+
+- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
+- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
+- The `--backend` option for PDFs was removed
+
+### Setting up a `DocumentConverter`
+
+To accomodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
+You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options 
+per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options`, defaults 
+will be used for all `allowed_formats`.
+
+Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
+They are provided as format-specific types, such as `PdfFormatOption` or `WordFormatOption`, as seen below.
+
+```python
+from docling.document_converter import DocumentConverter
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import (
+    DocumentConverter,
+    PdfFormatOption,
+    WordFormatOption,
+)
+from docling.pipeline.simple_pipeline import SimplePipeline
+from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
+
+## Default initialization still works as before:
+# doc_converter = DocumentConverter() 
+
+
+# previous `PipelineOptions` is now `PdfPipelineOptions`
+pipeline_options = PdfPipelineOptions()
+pipeline_options.do_ocr = False
+pipeline_options.do_table_structure = True
+#...
+
+## Custom options are now defined per format. 
+doc_converter = (
+    DocumentConverter(  # all of the below is optional, has internal defaults.
+        allowed_formats=[
+            InputFormat.PDF,
+            InputFormat.IMAGE,
+            InputFormat.DOCX,
+            InputFormat.HTML,
+            InputFormat.PPTX,
+        ],  # whitelist formats, non-matching files are ignored.
+        format_options={
+            InputFormat.PDF: PdfFormatOption(
+                pipeline_options=pipeline_options, # pipeline options go here.
+                backend=PyPdfiumDocumentBackend # optional: pick an alternative backend
+            ),
+            InputFormat.DOCX: WordFormatOption(
+                pipeline_cls=SimplePipeline # default for office formats and HTML
+            ),
+        },
+    )
+)
+```
+
+**Note**: If you work only with defaults, all remains the same as in Docling v1.
+
+More options are shown in the following example units:
+
+- [run_with_formats.py](../examples/run_with_formats/)
+- [custom_convert.py](../examples/custom_convert/)
+
+### Converting documents
+
+We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for 
+better semantics. You can now call the conversion directly with a single file, or a list of input files, 
+or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.
+
+* `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single`).
+* `DocumentConverter.convert_all` now converts many files at once (previously `DocumentConverter.convert`).
+
+
+```python
+...
+from docling.datamodel.document import ConversionResult
+## Convert a single file (from URL or local path)
+conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
+
+## Convert several files at once:
+
+input_files = [
+    "tests/data/wiki_duck.html",
+    "tests/data/word_sample.docx",
+    "tests/data/lorem_ipsum.docx",
+    "tests/data/powerpoint_sample.pptx",
+    "tests/data/2305.03393v1-pg9-img.png",
+    "tests/data/2206.01062.pdf",
+]
+
+# Directly pass list of files or streams to `convert_all`
+conv_results_iter = doc_converter.convert_all(input_files) # previously `convert_batch`
+
+```
+Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first 
+encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
+By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
+
+```python
+...
+conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert_batch`
+
+```
+
+### Access document structures 
+
+We have simplified how you can access and export the converted document data, too. Our universal document representation
+is now available in conversion results as a `DoclingDocument` object.
+`DoclingDocument` provides a neat set of APIs to construct, iterate and export content in the document, as shown below.
+
+```python
+conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
+
+## Inspect the converted document:
+conv_result.document.print_element_tree()
+
+## Iterate the elements in reading order, including hierachy level:
+for item, level in conv_result.document.iterate_items:
+    if isinstance(item, TextItem):
+        print(item.text)
+    elif isinstance(item, TableItem):
+        table_df: pd.DataFrame = item.export_to_dataframe()
+        print(table_df.to_markdown())
+    elif ...:
+        #...
+```
+
+**Note**: While it is deprecated, you can _still_ work with the Docling v1 document representation, it is available as:
+```shell
+conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
+```
+
+## Export into JSON, Markdown, Doctags
+**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
+and are now available on `DoclingDocument` as:
+
+- `DoclingDocument.export_to_dict`
+- `DoclingDocument.export_to_markdown`
+- `DoclingDocument.export_to_document_tokens`
+
+```python
+conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
+
+## Export to desired format:
+print(json.dumps(conv_res.document.export_to_dict()))
+print(conv_res.document.export_to_markdown())
+print(conv_res.document.export_to_document_tokens())
+```
+
+**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same 
+methods as on the `DoclingDocument` type:
+```shell
+## Export legacy document representation to desired format, for v1 compatibility:
+print(json.dumps(conv_res.legacy_document.export_to_dict()))
+print(conv_res.legacy_document.export_to_markdown())
+print(conv_res.legacy_document.export_to_document_tokens())
+```
+
+## Reload a `DoclingDocument` stored as JSON
+
+You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
+
+```python
+# Save to disk:
+doc: DoclingDocument = conv_res.document # produced from conversion result...
+
+with Path("./doc.json").open("w") as fp:
+    fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency
+
+# Load from disk:
+with Path("./doc.json").open("r") as fp:
+    doc_dict = json.loads(fp.read())
+    doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc
+
+```
+