diff --git a/docs/concepts/architecture.md b/docs/concepts/architecture.md index 07aa1b3..00e81db 100644 --- a/docs/concepts/architecture.md +++ b/docs/concepts/architecture.md @@ -10,7 +10,7 @@ For each document format, the *document converter* knows which format-specific * The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation. -Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a *chunker*. +Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a [*chunker*](./chunking.md). For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869). diff --git a/docs/cli.md b/docs/reference/cli.md similarity index 82% rename from docs/cli.md rename to docs/reference/cli.md index 3f67df0..2561226 100644 --- a/docs/cli.md +++ b/docs/reference/cli.md @@ -1,4 +1,4 @@ -# CLI Reference +# CLI reference This page provides documentation for our command line tools. @@ -6,4 +6,4 @@ This page provides documentation for our command line tools. :module: docling.cli.main :command: click_app :prog_name: docling - :style: table \ No newline at end of file + :style: table diff --git a/docs/api_reference/docling_document.md b/docs/reference/docling_document.md similarity index 100% rename from docs/api_reference/docling_document.md rename to docs/reference/docling_document.md diff --git a/docs/api_reference/document_converter.md b/docs/reference/document_converter.md similarity index 100% rename from docs/api_reference/document_converter.md rename to docs/reference/document_converter.md diff --git a/docs/api_reference/pipeline_options.md b/docs/reference/pipeline_options.md similarity index 100% rename from docs/api_reference/pipeline_options.md rename to docs/reference/pipeline_options.md diff --git a/docs/usage.md b/docs/usage.md index e7a214a..9a5b555 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -22,9 +22,7 @@ A simple example would look like this: docling https://arxiv.org/pdf/2206.01062 ``` -To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./cli.md). - - +To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./reference/cli.md). ### Advanced options @@ -130,29 +128,37 @@ You can limit the CPU threads used by Docling by setting the environment variabl ## Chunking -You can perform a hierarchy-aware chunking of a Docling document as follows: +You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a +`HybridChunker`, as shown below (for more details check out +[this example](examples/hybrid_chunking.ipynb)): ```python from docling.document_converter import DocumentConverter -from docling_core.transforms.chunker import HierarchicalChunker +from docling.chunking import HybridChunker conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062") doc = conv_res.document -chunks = list(HierarchicalChunker().chunk(doc)) -print(chunks[30]) +chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed +chunk_iter = chunker.chunk(doc) +``` + +An example chunk would look like this: + +```python +print(list(chunk_iter)[11]) # { -# "text": "Lately, new types of ML models for document-layout analysis have emerged [...]", +# "text": "In this paper, we present the DocLayNet dataset. [...]", # "meta": { # "doc_items": [{ -# "self_ref": "#/texts/40", +# "self_ref": "#/texts/28", # "label": "text", # "prov": [{ # "page_no": 2, -# "bbox": {"l": 317.06, "t": 325.81, "r": 559.18, "b": 239.97, ...}, -# }] -# }], -# "headings": ["2 RELATED WORK"], +# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...}, +# }], ..., +# }, ...], +# "headings": ["1 INTRODUCTION"], # } # } ``` diff --git a/mkdocs.yml b/mkdocs.yml index 81abcc6..6973824 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -56,7 +56,6 @@ nav: - "Docling": index.md - Installation: installation.md - Usage: usage.md - - CLI: cli.md - FAQ: faq.md - Docling v2: v2.md - Concepts: @@ -76,15 +75,12 @@ nav: - "Table export": examples/export_tables.py - "Multimodal export": examples/export_multimodal.py - "Force full page OCR": examples/full_page_ocr.py + - Chunking: + - "Hybrid chunking": examples/hybrid_chunking.ipynb - RAG / QA: - "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb - "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb - "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb - - Chunking: - - "Hybrid chunking": examples/hybrid_chunking.ipynb - # - Chunking: examples/chunking.md - # - CLI: - # - CLI: examples/cli.md - Integrations: - Integrations: integrations/index.md - "🐝 Bee": integrations/bee.md @@ -99,10 +95,13 @@ nav: - "spaCy": integrations/spacy.md - "txtai": integrations/txtai.md # - "LangChain 🦜🔗": integrations/langchain.md - - API reference: - - Document Converter: api_reference/document_converter.md - - Pipeline options: api_reference/pipeline_options.md - - Docling Document: api_reference/docling_document.md + - Reference: + - Python API: + - Document Converter: reference/document_converter.md + - Pipeline options: reference/pipeline_options.md + - Docling Document: reference/docling_document.md + - CLI: + - CLI reference: reference/cli.md markdown_extensions: - pymdownx.superfences