diff --git a/docs/concepts/architecture.md b/docs/concepts/architecture.md index e87db06..9138cc7 100644 --- a/docs/concepts/architecture.md +++ b/docs/concepts/architecture.md @@ -10,7 +10,8 @@ For each document format, the *document converter* knows which format-specific * The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation. -Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a [*chunker*](./chunking.md). +Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it serialized by a +[*serializer*](./serialization.md) or chunked by a [*chunker*](./chunking.md). For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869). diff --git a/docs/concepts/chunking.md b/docs/concepts/chunking.md index c552f3a..b554a47 100644 --- a/docs/concepts/chunking.md +++ b/docs/concepts/chunking.md @@ -31,7 +31,7 @@ The `BaseChunker` base class API defines that any chunker should provide the fol - `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`: Returning the chunks for the provided document. -- `def serialize(self, chunk: BaseChunk) -> str`: +- `def contextualize(self, chunk: BaseChunk) -> str`: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model). @@ -44,10 +44,14 @@ The `BaseChunker` base class API defines that any chunker should provide the fol from docling.chunking import HybridChunker ``` - If you are only using the `docling-core` package, you must ensure to install - the `chunking` extra, e.g. + the `chunking` extra if you want to use HuggingFace tokenizers, e.g. ```shell pip install 'docling-core[chunking]' ``` + or the `chunking-openai` extra if you prefer Open AI tokenizers (tiktoken), e.g. + ```shell + pip install 'docling-core[chunking-openai]' + ``` and then you can import as follows: ```python diff --git a/docs/concepts/serialization.md b/docs/concepts/serialization.md new file mode 100644 index 0000000..d582056 --- /dev/null +++ b/docs/concepts/serialization.md @@ -0,0 +1,40 @@ +## Introduction + +A *document serializer* (AKA simply *serializer*) is a Docling abstraction that is +initialized with a given [`DoclingDocument`](./docling_document.md) and returns a +textual representation for that document. + +Besides the document serializer, Docling defines similar abstractions for several +document subcomponents, for example: *text serializer*, *table serializer*, +*picture serializer*, *list serializer*, *inline serializer*, and more. + +Last but not least, a *serializer provider* is a wrapper that abstracts the +document serialization strategy from the document instance. + +## Base classes + +To enable both flexibility for downstream applications and out-of-the-box utility, +Docling defines a serialization class hierarchy, providing: + +- base types for the above abstractions: `BaseDocSerializer`, as well as + `BaseTextSerializer`, `BaseTableSerializer` etc, and `BaseSerializerProvider`, and +- specific subclasses for the above-mentioned base types, e.g. `MarkdownDocSerializer`. + +You can review all methods required to define the above base classes [here](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/serializer/base.py). + +From a client perspective, the most relevant is `BaseDocSerializer.serialize()`, which +returns the textual representation, as well as relevant metadata on which document +components contributed to that serialization. + +## Use in `DoclingDocument` export methods + +Docling provides predefined serializers for Markdown, HTML, and DocTags. + +The respective `DoclingDocument` export methods (e.g. `export_to_markdown()`) are +provided as user shorthands — internally directly instantiating and delegating to +respective serializers. + +## Examples + +For an example showcasing how to use serializers, see +[here](../examples/serialization.ipynb). diff --git a/docs/examples/hybrid_chunking.ipynb b/docs/examples/hybrid_chunking.ipynb index 68795a0..7acf5af 100644 --- a/docs/examples/hybrid_chunking.ipynb +++ b/docs/examples/hybrid_chunking.ipynb @@ -44,14 +44,7 @@ } ], "source": [ - "%pip install -qU docling transformers" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conversion" + "%pip install -qU pip docling transformers" ] }, { @@ -59,11 +52,32 @@ "execution_count": 2, "metadata": {}, "outputs": [], + "source": [ + "DOC_SOURCE = \"../../tests/data/md/wiki.md\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Basic usage" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first convert the document:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], "source": [ "from docling.document_converter import DocumentConverter\n", "\n", - "DOC_SOURCE = \"../../tests/data/md/wiki.md\"\n", - "\n", "doc = DocumentConverter().convert(source=DOC_SOURCE).document" ] }, @@ -71,17 +85,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Chunking\n", - "\n", - "### Basic usage\n", - "\n", - "For a basic usage scenario, we can just instantiate a `HybridChunker`, which will use\n", + "For a basic chunking scenario, we can just instantiate a `HybridChunker`, which will use\n", "the default parameters." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -111,12 +121,12 @@ "metadata": {}, "source": [ "Note that the text you would typically want to embed is the context-enriched one as\n", - "returned by the `serialize()` method:" + "returned by the `contextualize()` method:" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -126,25 +136,25 @@ "=== 0 ===\n", "chunk.text:\n", "'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver…'\n", - "chunker.serialize(chunk):\n", + "chunker.contextualize(chunk):\n", "'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial …'\n", "\n", "=== 1 ===\n", "chunk.text:\n", "'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…'\n", - "chunker.serialize(chunk):\n", + "chunker.contextualize(chunk):\n", "'IBM\\n1910s–1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…'\n", "\n", "=== 2 ===\n", "chunk.text:\n", "'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…'\n", - "chunker.serialize(chunk):\n", + "chunker.contextualize(chunk):\n", "'IBM\\n1910s–1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …'\n", "\n", "=== 3 ===\n", "chunk.text:\n", "'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n", - "chunker.serialize(chunk):\n", + "chunker.contextualize(chunk):\n", "'IBM\\n1960s–1980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n", "\n" ] @@ -155,8 +165,8 @@ " print(f\"=== {i} ===\")\n", " print(f\"chunk.text:\\n{f'{chunk.text[:300]}…'!r}\")\n", "\n", - " enriched_text = chunker.serialize(chunk=chunk)\n", - " print(f\"chunker.serialize(chunk):\\n{f'{enriched_text[:300]}…'!r}\")\n", + " enriched_text = chunker.contextualize(chunk=chunk)\n", + " print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}…'!r}\")\n", "\n", " print()" ] @@ -165,23 +175,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Advanced usage\n", + "## Configuring tokenization\n", "\n", - "For more control on the chunking, we can parametrize through the `HybridChunker`\n", - "arguments illustrated below.\n", + "For more control on the chunking, we can parametrize tokenization as shown below.\n", "\n", - "Notice how `tokenizer` and `embed_model` further below are single-sourced from\n", - "`EMBED_MODEL_ID`.\n", - "This is important for making sure the chunker and the embedding model are using the same\n", - "tokenizer." + "In a RAG / retrieval context, it is important to make sure that the chunker and\n", + "embedding model are using the same tokenizer.\n", + "\n", + "👉 HuggingFace transformers tokenizers can be used as shown in the following example:" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ + "from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n", "from transformers import AutoTokenizer\n", "\n", "from docling.chunking import HybridChunker\n", @@ -189,11 +199,50 @@ "EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n", "MAX_TOKENS = 64 # set to a small number for illustrative purposes\n", "\n", - "tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)\n", + "tokenizer = HuggingFaceTokenizer(\n", + " tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n", + " max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "👉 Alternatively, [OpenAI tokenizers](https://github.com/openai/tiktoken) can be used as shown in the example below (uncomment to use — requires installing `docling-core[chunking-openai]`):" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# import tiktoken\n", "\n", + "# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n", + "\n", + "# tokenizer = OpenAITokenizer(\n", + "# tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"),\n", + "# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers\n", + "# )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now instantiate our chunker:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ "chunker = HybridChunker(\n", - " tokenizer=tokenizer, # instance or model name, defaults to \"sentence-transformers/all-MiniLM-L6-v2\"\n", - " max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer`\n", + " tokenizer=tokenizer,\n", " merge_peers=True, # optional, defaults to True\n", ")\n", "chunk_iter = chunker.chunk(dl_doc=doc)\n", @@ -213,7 +262,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -223,127 +272,127 @@ "=== 0 ===\n", "chunk.text (55 tokens):\n", "'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n", - "chunker.serialize(chunk) (56 tokens):\n", + "chunker.contextualize(chunk) (56 tokens):\n", "'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n", "\n", "=== 1 ===\n", "chunk.text (45 tokens):\n", "'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n", - "chunker.serialize(chunk) (46 tokens):\n", + "chunker.contextualize(chunk) (46 tokens):\n", "'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n", "\n", "=== 2 ===\n", "chunk.text (63 tokens):\n", "'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n", "\n", "=== 3 ===\n", "chunk.text (44 tokens):\n", "\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n", - "chunker.serialize(chunk) (45 tokens):\n", + "chunker.contextualize(chunk) (45 tokens):\n", "\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n", "\n", "=== 4 ===\n", "chunk.text (63 tokens):\n", "'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n", "\n", "=== 5 ===\n", "chunk.text (61 tokens):\n", "'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n", - "chunker.serialize(chunk) (62 tokens):\n", + "chunker.contextualize(chunk) (62 tokens):\n", "'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n", "\n", "=== 6 ===\n", "chunk.text (62 tokens):\n", "\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n", - "chunker.serialize(chunk) (63 tokens):\n", + "chunker.contextualize(chunk) (63 tokens):\n", "\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n", "\n", "=== 7 ===\n", "chunk.text (63 tokens):\n", "'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n", "\n", "=== 8 ===\n", "chunk.text (5 tokens):\n", "'Awards.[16]'\n", - "chunker.serialize(chunk) (6 tokens):\n", + "chunker.contextualize(chunk) (6 tokens):\n", "'IBM\\nAwards.[16]'\n", "\n", "=== 9 ===\n", "chunk.text (56 tokens):\n", "'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n", - "chunker.serialize(chunk) (60 tokens):\n", + "chunker.contextualize(chunk) (60 tokens):\n", "'IBM\\n1910s–1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n", "\n", "=== 10 ===\n", "chunk.text (60 tokens):\n", "\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "\"IBM\\n1910s–1950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n", "\n", "=== 11 ===\n", "chunk.text (59 tokens):\n", "'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n", - "chunker.serialize(chunk) (63 tokens):\n", + "chunker.contextualize(chunk) (63 tokens):\n", "'IBM\\n1910s–1950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n", "\n", "=== 12 ===\n", "chunk.text (13 tokens):\n", "'D.C.; and Toronto, Canada.[22]'\n", - "chunker.serialize(chunk) (17 tokens):\n", + "chunker.contextualize(chunk) (17 tokens):\n", "'IBM\\n1910s–1950s\\nD.C.; and Toronto, Canada.[22]'\n", "\n", "=== 13 ===\n", "chunk.text (60 tokens):\n", "'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "'IBM\\n1910s–1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n", "\n", "=== 14 ===\n", "chunk.text (59 tokens):\n", "\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n", - "chunker.serialize(chunk) (63 tokens):\n", + "chunker.contextualize(chunk) (63 tokens):\n", "\"IBM\\n1910s–1950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n", "\n", "=== 15 ===\n", "chunk.text (23 tokens):\n", "\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n", - "chunker.serialize(chunk) (27 tokens):\n", + "chunker.contextualize(chunk) (27 tokens):\n", "\"IBM\\n1910s–1950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n", "\n", "=== 16 ===\n", "chunk.text (59 tokens):\n", "'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n", - "chunker.serialize(chunk) (63 tokens):\n", + "chunker.contextualize(chunk) (63 tokens):\n", "'IBM\\n1910s–1950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n", "\n", "=== 17 ===\n", "chunk.text (60 tokens):\n", "'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n", - "chunker.serialize(chunk) (64 tokens):\n", + "chunker.contextualize(chunk) (64 tokens):\n", "'IBM\\n1910s–1950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n", "\n", "=== 18 ===\n", "chunk.text (57 tokens):\n", "'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n", - "chunker.serialize(chunk) (61 tokens):\n", + "chunker.contextualize(chunk) (61 tokens):\n", "'IBM\\n1910s–1950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n", "\n", "=== 19 ===\n", "chunk.text (21 tokens):\n", "'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n", - "chunker.serialize(chunk) (25 tokens):\n", + "chunker.contextualize(chunk) (25 tokens):\n", "'IBM\\n1910s–1950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n", "\n", "=== 20 ===\n", "chunk.text (22 tokens):\n", "'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n", - "chunker.serialize(chunk) (26 tokens):\n", + "chunker.contextualize(chunk) (26 tokens):\n", "'IBM\\n1960s–1980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n", "\n" ] @@ -352,15 +401,32 @@ "source": [ "for i, chunk in enumerate(chunks):\n", " print(f\"=== {i} ===\")\n", - " txt_tokens = len(tokenizer.tokenize(chunk.text))\n", + " txt_tokens = tokenizer.count_tokens(chunk.text)\n", " print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")\n", "\n", - " ser_txt = chunker.serialize(chunk=chunk)\n", - " ser_tokens = len(tokenizer.tokenize(ser_txt))\n", - " print(f\"chunker.serialize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n", + " ser_txt = chunker.contextualize(chunk=chunk)\n", + " ser_tokens = tokenizer.count_tokens(ser_txt)\n", + " print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n", "\n", " print()" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configuring serialization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can additionally customize the serialization strategy via a user-provided\n", + "[serializer provider](../../concepts/serialization).\n", + "\n", + "For usage examples check out [this notebook](https://github.com/docling-project/docling-core/blob/main/examples/chunking_and_serialization.ipynb)." + ] } ], "metadata": { @@ -379,7 +445,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.7" + "version": "3.13.2" } }, "nbformat": 4, diff --git a/docs/examples/serialization.ipynb b/docs/examples/serialization.ipynb new file mode 100644 index 0000000..26e5c55 --- /dev/null +++ b/docs/examples/serialization.ipynb @@ -0,0 +1,665 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Serialization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook we showcase the usage of Docling [serializers](../../concepts/serialization)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -qU pip docling docling-core~=2.29 rich" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n", + "\n", + "# we set some start-stop cues for defining an excerpt to print\n", + "start_cue = \"Copyright © 2024\"\n", + "stop_cue = \"Application of NLP to ESG\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from rich.console import Console\n", + "from rich.panel import Panel\n", + "\n", + "console = Console(width=210) # for preventing Markdown table wrapped rendering\n", + "\n", + "\n", + "def print_in_console(text):\n", + " console.print(Panel(text))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Basic usage" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We first convert the document:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n", + " warnings.warn(warn_msg)\n" + ] + } + ], + "source": [ + "from docling.document_converter import DocumentConverter\n", + "\n", + "converter = DocumentConverter()\n", + "doc = converter.convert(source=DOC_SOURCE).document" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now apply any `BaseDocSerializer` on the produced document.\n", + "\n", + "👉 Note that, to keep the shown output brief, we only print an excerpt.\n", + "\n", + "E.g. below we apply an `HTMLDocSerializer`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p>                                                                                          │\n",
+       "│ <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM         │\n",
+       "│ 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in                    │\n",
+       "│ India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con-         │\n",
+       "│ sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate-    │\n",
+       "│ rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table>                                                       │\n",
+       "│ <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p>                                                                                             │\n",
+       "│ <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the   │\n",
+       "│ response.</p>                                                                                                                                                                                                  │\n",
+       "│ <h2>Related Work</h2>                                                                                                                                                                                          │\n",
+       "│ <p>The DocQA integrates multiple AI technologies, namely:</p>                                                                                                                                                  │\n",
+       "│ <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric     │\n",
+       "│ layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et │\n",
+       "│ al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .  │\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p>                        │\n",
+       "│ <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure>                                                                            │\n",
+       "│ <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p>                     │\n",
+       "│ <p>                                                                                                                                                                                                            │\n",
+       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
+       "
\n" + ], + "text/plain": [ + "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", + "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

│\n", + "│
ReportQuestionAnswer
IBM 2022How many hours were spent on employee learning in 2021?22.5 million hours
IBM │\n", + "│ 2022What was the rate of fatalities in 2021?The rate of fatalities in 2021 was 0.0016.
IBM 2022How many full audits were con- ducted in 2022 in │\n", + "│ India?2
Starbucks 2022What is the percentage of women in the Board of Directors?25%
Starbucks 2022What was the total energy con- │\n", + "│ sumption in 2021?According to the table, the total energy consumption in 2021 was 2,491,543 MWh.
Starbucks 2022How much packaging material was made from renewable mate- │\n", + "│ rials?According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.
│\n", + "│

Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.

│\n", + "│

ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │\n", + "│ response.

│\n", + "│

Related Work

│\n", + "│

The DocQA integrates multiple AI technologies, namely:

│\n", + "│

Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric │\n", + "│ layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et │\n", + "│ al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │\n", + "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-

│\n", + "│
Figure 1: System architecture: Simplified sketch of document question-answering pipeline.
│\n", + "│

based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).

│\n", + "│

│\n", + "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from docling_core.transforms.serializer.html import HTMLDocSerializer\n", + "\n", + "serializer = HTMLDocSerializer(doc=doc)\n", + "ser_result = serializer.serialize()\n", + "ser_text = ser_result.text\n", + "\n", + "# we here only print an excerpt to keep the output brief:\n", + "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following example, we use a `MarkdownDocSerializer`:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
+       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
+       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
+       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
+       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
+       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
+       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
+       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
+       "│ response.                                                                                                                                                                                                      │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ <!-- image -->                                                                                                                                                                                                 │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
+       "
\n" + ], + "text/plain": [ + "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", + "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │\n", + "│ │\n", + "│ | Report | Question | Answer | │\n", + "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| │\n", + "│ | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | │\n", + "│ | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | │\n", + "│ | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | │\n", + "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | │\n", + "│ | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | │\n", + "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | │\n", + "│ │\n", + "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │\n", + "│ │\n", + "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │\n", + "│ response. │\n", + "│ │\n", + "│ ## Related Work │\n", + "│ │\n", + "│ The DocQA integrates multiple AI technologies, namely: │\n", + "│ │\n", + "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n", + "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │\n", + "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │\n", + "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │\n", + "│ │\n", + "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │\n", + "│ │\n", + "│ │\n", + "│ │\n", + "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │\n", + "│ │\n", + "│ │\n", + "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from docling_core.transforms.serializer.markdown import MarkdownDocSerializer\n", + "\n", + "serializer = MarkdownDocSerializer(doc=doc)\n", + "ser_result = serializer.serialize()\n", + "ser_text = ser_result.text\n", + "\n", + "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Configuring a serializer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now assume we would like to reconfigure the Markdown serialization such that:\n", + "- it uses a different component serializer, e.g. if we'd prefer tables to be printed in a triplet format (which could potentially improve the vector representation compared to Markdown tables)\n", + "- it uses specific user-defined parameters, e.g. if we'd prefer a different image placeholder text than the default one\n", + "\n", + "Check out the following configuration and notice the serialization differences in the output further below:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The     │\n",
+       "│ rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the  │\n",
+       "│ Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption  │\n",
+       "│ in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were │\n",
+       "│ made from recycled or renewable materials in FY22.                                                                                                                                                             │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
+       "│ response.                                                                                                                                                                                                      │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ <!-- demo picture placeholder -->                                                                                                                                                                              │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
+       "
\n" + ], + "text/plain": [ + "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", + "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │\n", + "│ │\n", + "│ IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The │\n", + "│ rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the │\n", + "│ Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption │\n", + "│ in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were │\n", + "│ made from recycled or renewable materials in FY22. │\n", + "│ │\n", + "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │\n", + "│ │\n", + "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │\n", + "│ response. │\n", + "│ │\n", + "│ ## Related Work │\n", + "│ │\n", + "│ The DocQA integrates multiple AI technologies, namely: │\n", + "│ │\n", + "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n", + "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │\n", + "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │\n", + "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │\n", + "│ │\n", + "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │\n", + "│ │\n", + "│ │\n", + "│ │\n", + "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │\n", + "│ │\n", + "│ │\n", + "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer\n", + "from docling_core.transforms.serializer.markdown import MarkdownParams\n", + "\n", + "serializer = MarkdownDocSerializer(\n", + " doc=doc,\n", + " table_serializer=TripletTableSerializer(),\n", + " params=MarkdownParams(\n", + " image_placeholder=\"\",\n", + " # ...\n", + " ),\n", + ")\n", + "ser_result = serializer.serialize()\n", + "ser_text = ser_result.text\n", + "\n", + "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a custom serializer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the examples above, we were able to reuse existing implementations for our desired\n", + "serialization strategy, but let's now assume we want to define a custom serialization\n", + "logic, e.g. we would like picture serialization to include any available picture\n", + "description (captioning) annotations.\n", + "\n", + "To that end, we first need to revisit our conversion and include all pipeline options\n", + "needed for\n", + "[picture description enrichment](https://docling-project.github.io/docling/usage/enrichments/#picture-description)." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n", + " warnings.warn(warn_msg)\n" + ] + } + ], + "source": [ + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import (\n", + " PdfPipelineOptions,\n", + " PictureDescriptionVlmOptions,\n", + ")\n", + "from docling.document_converter import DocumentConverter, PdfFormatOption\n", + "\n", + "pipeline_options = PdfPipelineOptions(\n", + " do_picture_description=True,\n", + " picture_description_options=PictureDescriptionVlmOptions(\n", + " repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",\n", + " prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",\n", + " ),\n", + " generate_picture_images=True,\n", + " images_scale=2,\n", + ")\n", + "\n", + "converter = DocumentConverter(\n", + " format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n", + ")\n", + "doc = converter.convert(source=DOC_SOURCE).document" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can then define our custom picture serializer:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Any, Optional\n", + "\n", + "from docling_core.transforms.serializer.base import (\n", + " BaseDocSerializer,\n", + " SerializationResult,\n", + ")\n", + "from docling_core.transforms.serializer.common import create_ser_result\n", + "from docling_core.transforms.serializer.markdown import (\n", + " MarkdownParams,\n", + " MarkdownPictureSerializer,\n", + ")\n", + "from docling_core.types.doc.document import (\n", + " DoclingDocument,\n", + " ImageRefMode,\n", + " PictureDescriptionData,\n", + " PictureItem,\n", + ")\n", + "from typing_extensions import override\n", + "\n", + "\n", + "class AnnotationPictureSerializer(MarkdownPictureSerializer):\n", + " @override\n", + " def serialize(\n", + " self,\n", + " *,\n", + " item: PictureItem,\n", + " doc_serializer: BaseDocSerializer,\n", + " doc: DoclingDocument,\n", + " separator: Optional[str] = None,\n", + " **kwargs: Any,\n", + " ) -> SerializationResult:\n", + " text_parts: list[str] = []\n", + "\n", + " # reusing the existing result:\n", + " parent_res = super().serialize(\n", + " item=item,\n", + " doc_serializer=doc_serializer,\n", + " doc=doc,\n", + " **kwargs,\n", + " )\n", + " text_parts.append(parent_res.text)\n", + "\n", + " # appending annotations:\n", + " for annotation in item.annotations:\n", + " if isinstance(annotation, PictureDescriptionData):\n", + " text_parts.append(f\"\")\n", + "\n", + " text_res = (separator or \"\\n\").join(text_parts)\n", + " return create_ser_result(text=text_res, span_source=item)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Last but not least, we define a new doc serializer which leverages our custom picture\n", + "serializer.\n", + "\n", + "Notice the picture description annotations in the output below:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n",
+       "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ | Report         | Question                                                         | Answer                                                                                                          |        │\n",
+       "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        │\n",
+       "│ | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        │\n",
+       "│ | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        │\n",
+       "│ | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        │\n",
+       "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        │\n",
+       "│ | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        │\n",
+       "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      │\n",
+       "│ response.                                                                                                                                                                                                      │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ ## Related Work                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n",
+       "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    │\n",
+       "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      │\n",
+       "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      │\n",
+       "│ <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document           │\n",
+       "│ conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response        │\n",
+       "│ generation step involves generating a response from the information retrieval step. -->                                                                                                                        │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "│                                                                                                                                                                                                                │\n",
+       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
+       "
\n" + ], + "text/plain": [ + "╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", + "│ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │\n", + "│ │\n", + "│ | Report | Question | Answer | │\n", + "│ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| │\n", + "│ | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | │\n", + "│ | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | │\n", + "│ | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | │\n", + "│ | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | │\n", + "│ | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | │\n", + "│ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | │\n", + "│ │\n", + "│ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │\n", + "│ │\n", + "│ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │\n", + "│ response. │\n", + "│ │\n", + "│ ## Related Work │\n", + "│ │\n", + "│ The DocQA integrates multiple AI technologies, namely: │\n", + "│ │\n", + "│ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │\n", + "│ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │\n", + "│ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │\n", + "│ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │\n", + "│ │\n", + "│ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │\n", + "│ │\n", + "│ │\n", + "│ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │\n", + "│ │\n", + "│ │\n", + "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "serializer = MarkdownDocSerializer(\n", + " doc=doc,\n", + " picture_serializer=AnnotationPictureSerializer(),\n", + " params=MarkdownParams(\n", + " image_mode=ImageRefMode.PLACEHOLDER,\n", + " image_placeholder=\"\",\n", + " ),\n", + ")\n", + "ser_result = serializer.serialize()\n", + "ser_text = ser_result.text\n", + "\n", + "print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/mkdocs.yml b/mkdocs.yml index cff7b4c..eb3ce0c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -66,6 +66,7 @@ nav: - Concepts: concepts/index.md - Architecture: concepts/architecture.md - Docling Document: concepts/docling_document.md + - Serialization: concepts/serialization.md - Chunking: concepts/chunking.md - Plugins: concepts/plugins.md - Examples: @@ -87,6 +88,8 @@ nav: - "Simple translation": examples/translate.py - examples/backend_csv.ipynb - examples/backend_xml_rag.ipynb + - 📤 Serialization: + - examples/serialization.ipynb - ✂️ Chunking: - examples/hybrid_chunking.ipynb - 🤖 RAG with AI dev frameworks: