docs: add integrations, revamp docs (#693)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
d49650c54f
commit
2d24faecd9
@ -4,7 +4,30 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# Hybrid Chunking"
|
"# Hybrid chunking"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Overview"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.\n",
|
||||||
|
"\n",
|
||||||
|
"For more details, see [here](../../concepts/chunking#hybrid-chunker)."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Setup"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -21,7 +44,7 @@
|
|||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"%pip install -qU 'docling-core[chunking]' sentence-transformers transformers lancedb"
|
"%pip install -qU 'docling-core[chunking]' sentence-transformers transformers"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -48,16 +71,12 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Chunking"
|
"## Chunking\n",
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"Notice how `tokenizer` and `embed_model` further below are single-sourced from `EMBED_MODEL_ID`.\n",
|
|
||||||
"\n",
|
"\n",
|
||||||
"This is important for making sure the chunker and the embedding model are using the same tokenizer."
|
"### Basic usage\n",
|
||||||
|
"\n",
|
||||||
|
"For a basic usage scenario, we can just instantiate a `HybridChunker`, which will use\n",
|
||||||
|
"the default parameters."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -65,20 +84,102 @@
|
|||||||
"execution_count": 3,
|
"execution_count": 3,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from docling.chunking import HybridChunker\n",
|
||||||
|
"\n",
|
||||||
|
"chunker = HybridChunker()\n",
|
||||||
|
"chunk_iter = chunker.chunk(dl_doc=doc)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Note that the text you would typically want to embed is the context-enriched one as\n",
|
||||||
|
"returned by the `serialize()` method:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"=== 0 ===\n",
|
||||||
|
"chunk.text:\n",
|
||||||
|
"'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver…'\n",
|
||||||
|
"chunker.serialize(chunk):\n",
|
||||||
|
"'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial …'\n",
|
||||||
|
"\n",
|
||||||
|
"=== 1 ===\n",
|
||||||
|
"chunk.text:\n",
|
||||||
|
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…'\n",
|
||||||
|
"chunker.serialize(chunk):\n",
|
||||||
|
"'IBM\\n1910s–1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…'\n",
|
||||||
|
"\n",
|
||||||
|
"=== 2 ===\n",
|
||||||
|
"chunk.text:\n",
|
||||||
|
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…'\n",
|
||||||
|
"chunker.serialize(chunk):\n",
|
||||||
|
"'IBM\\n1910s–1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …'\n",
|
||||||
|
"\n",
|
||||||
|
"=== 3 ===\n",
|
||||||
|
"chunk.text:\n",
|
||||||
|
"'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n",
|
||||||
|
"chunker.serialize(chunk):\n",
|
||||||
|
"'IBM\\n1960s–1980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for i, chunk in enumerate(chunk_iter):\n",
|
||||||
|
" print(f\"=== {i} ===\")\n",
|
||||||
|
" print(f\"chunk.text:\\n{repr(f'{chunk.text[:300]}…')}\")\n",
|
||||||
|
"\n",
|
||||||
|
" enriched_text = chunker.serialize(chunk=chunk)\n",
|
||||||
|
" print(f\"chunker.serialize(chunk):\\n{repr(f'{enriched_text[:300]}…')}\")\n",
|
||||||
|
"\n",
|
||||||
|
" print()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Advanced usage\n",
|
||||||
|
"\n",
|
||||||
|
"For more control on the chunking, we can parametrize through the `HybridChunker`\n",
|
||||||
|
"arguments illustrated below.\n",
|
||||||
|
"\n",
|
||||||
|
"Notice how `tokenizer` and `embed_model` further below are single-sourced from\n",
|
||||||
|
"`EMBED_MODEL_ID`.\n",
|
||||||
|
"This is important for making sure the chunker and the embedding model are using the same\n",
|
||||||
|
"tokenizer."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"from transformers import AutoTokenizer\n",
|
"from transformers import AutoTokenizer\n",
|
||||||
"\n",
|
"\n",
|
||||||
"from docling.chunking import HybridChunker\n",
|
"from docling.chunking import HybridChunker\n",
|
||||||
"\n",
|
"\n",
|
||||||
"EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
|
"EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
|
||||||
"MAX_TOKENS = 64\n",
|
"MAX_TOKENS = 64 # set to a small number for illustrative purposes\n",
|
||||||
"\n",
|
"\n",
|
||||||
"tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)\n",
|
"tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"chunker = HybridChunker(\n",
|
"chunker = HybridChunker(\n",
|
||||||
" tokenizer=tokenizer, # can also just pass model name instead of tokenizer instance\n",
|
" tokenizer=tokenizer, # instance or model name, defaults to \"sentence-transformers/all-MiniLM-L6-v2\"\n",
|
||||||
" max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer`\n",
|
" max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer`\n",
|
||||||
" # merge_peers=True, # optional, defaults to True\n",
|
" merge_peers=True, # optional, defaults to True\n",
|
||||||
")\n",
|
")\n",
|
||||||
"chunk_iter = chunker.chunk(dl_doc=doc)\n",
|
"chunk_iter = chunker.chunk(dl_doc=doc)\n",
|
||||||
"chunks = list(chunk_iter)"
|
"chunks = list(chunk_iter)"
|
||||||
@ -88,7 +189,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Points to notice:\n",
|
"Points to notice looking at the output chunks below:\n",
|
||||||
"- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)\n",
|
"- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)\n",
|
||||||
"- Where neeeded, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)\n",
|
"- Where neeeded, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)\n",
|
||||||
"- Where possible, we merge undersized peer chunks (see chunk 0)\n",
|
"- Where possible, we merge undersized peer chunks (see chunk 0)\n",
|
||||||
@ -97,7 +198,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 4,
|
"execution_count": 6,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
@ -245,174 +346,6 @@
|
|||||||
"\n",
|
"\n",
|
||||||
" print()"
|
" print()"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Vector Retrieval"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 5,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"name": "stderr",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
|
|
||||||
"To disable this warning, you can either:\n",
|
|
||||||
"\t- Avoid using `tokenizers` before the fork if possible\n",
|
|
||||||
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"from sentence_transformers import SentenceTransformer\n",
|
|
||||||
"\n",
|
|
||||||
"embed_model = SentenceTransformer(EMBED_MODEL_ID)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 6,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [
|
|
||||||
{
|
|
||||||
"data": {
|
|
||||||
"text/html": [
|
|
||||||
"<div>\n",
|
|
||||||
"<style scoped>\n",
|
|
||||||
" .dataframe tbody tr th:only-of-type {\n",
|
|
||||||
" vertical-align: middle;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe tbody tr th {\n",
|
|
||||||
" vertical-align: top;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe thead th {\n",
|
|
||||||
" text-align: right;\n",
|
|
||||||
" }\n",
|
|
||||||
"</style>\n",
|
|
||||||
"<table border=\"1\" class=\"dataframe\">\n",
|
|
||||||
" <thead>\n",
|
|
||||||
" <tr style=\"text-align: right;\">\n",
|
|
||||||
" <th></th>\n",
|
|
||||||
" <th>vector</th>\n",
|
|
||||||
" <th>text</th>\n",
|
|
||||||
" <th>headings</th>\n",
|
|
||||||
" <th>captions</th>\n",
|
|
||||||
" <th>_distance</th>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </thead>\n",
|
|
||||||
" <tbody>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>0</th>\n",
|
|
||||||
" <td>[-0.1269039, -0.01948185, -0.07718097, -0.1116...</td>\n",
|
|
||||||
" <td>language, and the UPC barcode. The company has...</td>\n",
|
|
||||||
" <td>[IBM]</td>\n",
|
|
||||||
" <td>None</td>\n",
|
|
||||||
" <td>1.164613</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>1</th>\n",
|
|
||||||
" <td>[-0.10198064, 0.0055981805, -0.05095279, -0.13...</td>\n",
|
|
||||||
" <td>IBM originated with several technological inno...</td>\n",
|
|
||||||
" <td>[IBM, 1910s–1950s]</td>\n",
|
|
||||||
" <td>None</td>\n",
|
|
||||||
" <td>1.245144</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>2</th>\n",
|
|
||||||
" <td>[-0.057121325, -0.034115084, -0.018113216, -0....</td>\n",
|
|
||||||
" <td>As one of the world's oldest and largest techn...</td>\n",
|
|
||||||
" <td>[IBM]</td>\n",
|
|
||||||
" <td>None</td>\n",
|
|
||||||
" <td>1.355586</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>3</th>\n",
|
|
||||||
" <td>[-0.04429054, -0.058111433, -0.009330196, -0.0...</td>\n",
|
|
||||||
" <td>IBM is the largest industrial research organiz...</td>\n",
|
|
||||||
" <td>[IBM]</td>\n",
|
|
||||||
" <td>None</td>\n",
|
|
||||||
" <td>1.398617</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>4</th>\n",
|
|
||||||
" <td>[-0.11920792, 0.053496413, -0.042391937, -0.03...</td>\n",
|
|
||||||
" <td>Awards.[16]</td>\n",
|
|
||||||
" <td>[IBM]</td>\n",
|
|
||||||
" <td>None</td>\n",
|
|
||||||
" <td>1.446295</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </tbody>\n",
|
|
||||||
"</table>\n",
|
|
||||||
"</div>"
|
|
||||||
],
|
|
||||||
"text/plain": [
|
|
||||||
" vector \\\n",
|
|
||||||
"0 [-0.1269039, -0.01948185, -0.07718097, -0.1116... \n",
|
|
||||||
"1 [-0.10198064, 0.0055981805, -0.05095279, -0.13... \n",
|
|
||||||
"2 [-0.057121325, -0.034115084, -0.018113216, -0.... \n",
|
|
||||||
"3 [-0.04429054, -0.058111433, -0.009330196, -0.0... \n",
|
|
||||||
"4 [-0.11920792, 0.053496413, -0.042391937, -0.03... \n",
|
|
||||||
"\n",
|
|
||||||
" text headings \\\n",
|
|
||||||
"0 language, and the UPC barcode. The company has... [IBM] \n",
|
|
||||||
"1 IBM originated with several technological inno... [IBM, 1910s–1950s] \n",
|
|
||||||
"2 As one of the world's oldest and largest techn... [IBM] \n",
|
|
||||||
"3 IBM is the largest industrial research organiz... [IBM] \n",
|
|
||||||
"4 Awards.[16] [IBM] \n",
|
|
||||||
"\n",
|
|
||||||
" captions _distance \n",
|
|
||||||
"0 None 1.164613 \n",
|
|
||||||
"1 None 1.245144 \n",
|
|
||||||
"2 None 1.355586 \n",
|
|
||||||
"3 None 1.398617 \n",
|
|
||||||
"4 None 1.446295 "
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"execution_count": 6,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from tempfile import mkdtemp\n",
|
|
||||||
"\n",
|
|
||||||
"import lancedb\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def make_lancedb_index(db_uri, index_name, chunks, embedding_model):\n",
|
|
||||||
" db = lancedb.connect(db_uri)\n",
|
|
||||||
" data = []\n",
|
|
||||||
" for chunk in chunks:\n",
|
|
||||||
" embeddings = embedding_model.encode(chunker.serialize(chunk=chunk))\n",
|
|
||||||
" data_item = {\n",
|
|
||||||
" \"vector\": embeddings,\n",
|
|
||||||
" \"text\": chunk.text,\n",
|
|
||||||
" \"headings\": chunk.meta.headings,\n",
|
|
||||||
" \"captions\": chunk.meta.captions,\n",
|
|
||||||
" }\n",
|
|
||||||
" data.append(data_item)\n",
|
|
||||||
" tbl = db.create_table(index_name, data=data, exist_ok=True)\n",
|
|
||||||
" return tbl\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"db_uri = str(Path(mkdtemp()) / \"docling.db\")\n",
|
|
||||||
"index = make_lancedb_index(db_uri, doc.name, chunks, embed_model)\n",
|
|
||||||
"\n",
|
|
||||||
"sample_query = \"invent\"\n",
|
|
||||||
"sample_embedding = embed_model.encode(sample_query)\n",
|
|
||||||
"results = index.search(sample_embedding).limit(5)\n",
|
|
||||||
"\n",
|
|
||||||
"results.to_pandas()"
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
|
@ -14,6 +14,17 @@
|
|||||||
"# RAG with Haystack"
|
"# RAG with Haystack"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"| Step | Tech | Execution | \n",
|
||||||
|
"| --- | --- | --- |\n",
|
||||||
|
"| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n",
|
||||||
|
"| Vector store | Milvus | 💻 Local |\n",
|
||||||
|
"| Gen AI | Hugging Face Inference API | 🌐 Remote | "
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
@ -90,6 +101,7 @@
|
|||||||
"from docling_haystack.converter import ExportType\n",
|
"from docling_haystack.converter import ExportType\n",
|
||||||
"from dotenv import load_dotenv\n",
|
"from dotenv import load_dotenv\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"\n",
|
||||||
"def _get_env_from_colab_or_os(key):\n",
|
"def _get_env_from_colab_or_os(key):\n",
|
||||||
" try:\n",
|
" try:\n",
|
||||||
" from google.colab import userdata\n",
|
" from google.colab import userdata\n",
|
||||||
@ -102,6 +114,7 @@
|
|||||||
" pass\n",
|
" pass\n",
|
||||||
" return os.getenv(key)\n",
|
" return os.getenv(key)\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"\n",
|
||||||
"load_dotenv()\n",
|
"load_dotenv()\n",
|
||||||
"HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n",
|
"HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n",
|
||||||
"PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\n",
|
"PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\n",
|
||||||
|
@ -4,7 +4,25 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# RAG with LangChain 🦜🔗"
|
"# RAG with LangChain"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"| Step | Tech | Execution | \n",
|
||||||
|
"| --- | --- | --- |\n",
|
||||||
|
"| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n",
|
||||||
|
"| Vector store | Milvus | 💻 Local |\n",
|
||||||
|
"| Gen AI | Hugging Face Inference API | 🌐 Remote |"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Setup"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -49,13 +67,6 @@
|
|||||||
"load_dotenv()"
|
"load_dotenv()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Setup"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
@ -85,6 +96,7 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"from docling.document_converter import DocumentConverter\n",
|
"from docling.document_converter import DocumentConverter\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"\n",
|
||||||
"class DoclingPDFLoader(BaseLoader):\n",
|
"class DoclingPDFLoader(BaseLoader):\n",
|
||||||
"\n",
|
"\n",
|
||||||
" def __init__(self, file_path: str | list[str]) -> None:\n",
|
" def __init__(self, file_path: str | list[str]) -> None:\n",
|
||||||
@ -298,7 +310,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.12.4"
|
"version": "3.12.7"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
@ -11,7 +11,18 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# RAG with LlamaIndex 🦙"
|
"# RAG with LlamaIndex"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"| Step | Tech | Execution | \n",
|
||||||
|
"| --- | --- | --- |\n",
|
||||||
|
"| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n",
|
||||||
|
"| Vector store | Milvus | 💻 Local |\n",
|
||||||
|
"| Gen AI | Hugging Face Inference API | 🌐 Remote | "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -462,7 +473,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.12.4"
|
"version": "3.12.7"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
@ -1,14 +1,33 @@
|
|||||||
{
|
{
|
||||||
"cells": [
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"[](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"id": "Ag9kcX2B_atc"
|
"id": "Ag9kcX2B_atc"
|
||||||
},
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"# Performing RAG over PDFs with Weaviate and Docling\n",
|
"# RAG with Weaviate"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"| Step | Tech | Execution | \n",
|
||||||
|
"| --- | --- | --- |\n",
|
||||||
|
"| Embedding | Open AI | 🌐 Remote |\n",
|
||||||
|
"| Vector store | Weavieate | 💻 Local |\n",
|
||||||
|
"| Gen AI | Open AI | 🌐 Remote |\n",
|
||||||
|
"\n",
|
||||||
"## A recipe 🧑🍳 🐥 💚\n",
|
"## A recipe 🧑🍳 🐥 💚\n",
|
||||||
"[](https://colab.research.google.com/github/DS4SD/docling/blob/tree/main/docs/examples/rag_weaviate.ipynb)\n",
|
|
||||||
"\n",
|
"\n",
|
||||||
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
|
"This is a code recipe that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/).\n",
|
||||||
"\n",
|
"\n",
|
||||||
@ -711,7 +730,8 @@
|
|||||||
"provenance": []
|
"provenance": []
|
||||||
},
|
},
|
||||||
"kernelspec": {
|
"kernelspec": {
|
||||||
"display_name": "Python 3",
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
"name": "python3"
|
"name": "python3"
|
||||||
},
|
},
|
||||||
"language_info": {
|
"language_info": {
|
||||||
@ -724,7 +744,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.11.10"
|
"version": "3.12.7"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
@ -12,7 +12,17 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# Hybrid RAG with Qdrant"
|
"# Retrieval with Qdrant"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"| Step | Tech | Execution | \n",
|
||||||
|
"| --- | --- | --- |\n",
|
||||||
|
"| Embedding | FastEmbed | 💻 Local |\n",
|
||||||
|
"| Vector store | Qdrant | 💻 Local |"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -47,22 +57,19 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": null,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"\n",
|
|
||||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
|
|
||||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
|
|
||||||
"Note: you may need to restart the kernel to use updated packages.\n"
|
"Note: you may need to restart the kernel to use updated packages.\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"%pip install --no-warn-conflicts -q qdrant-client docling docling-core fastembed"
|
"%pip install --no-warn-conflicts -q qdrant-client docling fastembed"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -74,13 +81,13 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 1,
|
"execution_count": 2,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"from docling_core.transforms.chunker import HierarchicalChunker\n",
|
|
||||||
"from qdrant_client import QdrantClient\n",
|
"from qdrant_client import QdrantClient\n",
|
||||||
"\n",
|
"\n",
|
||||||
|
"from docling.chunking import HybridChunker\n",
|
||||||
"from docling.datamodel.base_models import InputFormat\n",
|
"from docling.datamodel.base_models import InputFormat\n",
|
||||||
"from docling.document_converter import DocumentConverter"
|
"from docling.document_converter import DocumentConverter"
|
||||||
]
|
]
|
||||||
@ -95,36 +102,16 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 2,
|
"execution_count": 3,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stderr",
|
||||||
"application/vnd.jupyter.widget-view+json": {
|
"output_type": "stream",
|
||||||
"model_id": "c1077c6634d9434584c41cc12f9107c9",
|
"text": [
|
||||||
"version_major": 2,
|
"/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n",
|
||||||
"version_minor": 0
|
" warnings.warn(\n"
|
||||||
},
|
|
||||||
"text/plain": [
|
|
||||||
"Fetching 5 files: 0%| | 0/5 [00:00<?, ?it/s]"
|
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "display_data"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"data": {
|
|
||||||
"application/vnd.jupyter.widget-view+json": {
|
|
||||||
"model_id": "67069c07b73448d491944452159d10bc",
|
|
||||||
"version_major": 2,
|
|
||||||
"version_minor": 0
|
|
||||||
},
|
|
||||||
"text/plain": [
|
|
||||||
"Fetching 29 files: 0%| | 0/29 [00:00<?, ?it/s]"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "display_data"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -149,7 +136,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 3,
|
"execution_count": 4,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@ -157,7 +144,7 @@
|
|||||||
" \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n",
|
" \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n",
|
||||||
")\n",
|
")\n",
|
||||||
"documents, metadatas = [], []\n",
|
"documents, metadatas = [], []\n",
|
||||||
"for chunk in HierarchicalChunker().chunk(result.document):\n",
|
"for chunk in HybridChunker().chunk(result.document):\n",
|
||||||
" documents.append(chunk.text)\n",
|
" documents.append(chunk.text)\n",
|
||||||
" metadatas.append(chunk.meta.export_json_dict())"
|
" metadatas.append(chunk.meta.export_json_dict())"
|
||||||
]
|
]
|
||||||
@ -173,95 +160,119 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 4,
|
"execution_count": 5,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [],
|
||||||
{
|
|
||||||
"data": {
|
|
||||||
"text/plain": [
|
|
||||||
"['e74ae15be5eb4805858307846318e784',\n",
|
|
||||||
" 'f83f6125b0fa4a0595ae6a0777c9d90d',\n",
|
|
||||||
" '9cf63c7f30764715bf3804a19db36d7d',\n",
|
|
||||||
" '007dbe6d355b4b49af3b736cbd63a4d8',\n",
|
|
||||||
" 'e5e31f21f2e84aa68beca0dfc532cbe9',\n",
|
|
||||||
" '69c10816af204bb28630a1f957d8dd3e',\n",
|
|
||||||
" 'b63546b9b1744063bdb076b234d883ca',\n",
|
|
||||||
" '90ad15ba8fa6494489e1d3221e30bfcf',\n",
|
|
||||||
" '13517debb483452ea40fc7aa04c08c50',\n",
|
|
||||||
" '84ccab5cfab74e27a55acef1c63e3fad',\n",
|
|
||||||
" 'e8aa2ef46d234c5a8a9da64b701d60b4',\n",
|
|
||||||
" '190bea5ba43c45e792197c50898d1d90',\n",
|
|
||||||
" 'a730319ea65645ca81e735ace0bcc72e',\n",
|
|
||||||
" '415e7f6f15864e30b836e23ae8d71b43',\n",
|
|
||||||
" '5569bce4e65541868c762d149c6f491e',\n",
|
|
||||||
" '74d9b234e9c04ebeb8e4e1ca625789ac',\n",
|
|
||||||
" '308b1c5006a94a679f4c8d6f2396993c',\n",
|
|
||||||
" 'aaa5ec6d385a418388e660c425bf1dbe',\n",
|
|
||||||
" '630be8e43e4e4472a9cdb9af9462a43a',\n",
|
|
||||||
" '643b316224de4770a5349bf69cf93471',\n",
|
|
||||||
" 'da9265e6f6c2485493d15223eefdf411',\n",
|
|
||||||
" 'a916e447d52c4084b5ce81a0c5a65b07',\n",
|
|
||||||
" '2883c620858e4e728b88e127155a4f2c',\n",
|
|
||||||
" '2a998f0e9c124af99027060b94027874',\n",
|
|
||||||
" 'be551fbd2b9e42f48ebae0cbf1f481bc',\n",
|
|
||||||
" '95b7f7608e974ca6847097ee4590fba1',\n",
|
|
||||||
" '309db4f3863b4e3aaf16d5f346c309f3',\n",
|
|
||||||
" 'c818383267f64fd68b2237b024bd724e',\n",
|
|
||||||
" '1f16e78338c94238892171b400051cd4',\n",
|
|
||||||
" '25c680c3e064462cab071ea9bf1bad8c',\n",
|
|
||||||
" 'f41ab7e480a248c6bb87019341c7ca74',\n",
|
|
||||||
" 'd440128bed6d4dcb987152b48ecd9a8a',\n",
|
|
||||||
" 'c110d5dfdc5849808851788c2404dd15']"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"execution_count": 4,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
"source": [
|
||||||
"client.add(COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64)"
|
"_ = client.add(\n",
|
||||||
|
" collection_name=COLLECTION_NAME,\n",
|
||||||
|
" documents=documents,\n",
|
||||||
|
" metadata=metadatas,\n",
|
||||||
|
" batch_size=64,\n",
|
||||||
|
")"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Query Documents"
|
"## Retrieval"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"points = client.query(\n",
|
||||||
|
" collection_name=COLLECTION_NAME,\n",
|
||||||
|
" query_text=\"Can I split documents?\",\n",
|
||||||
|
" limit=10,\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"<=== Retrieved documents ===>\n",
|
"=== 0 ===\n",
|
||||||
"Document Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\n",
|
"Have you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n",
|
||||||
"Document Specific Chunking can handle a variety of document formats, such as:\n",
|
|
||||||
"Consequently, there are also splitters available for this purpose.\n",
|
|
||||||
"1. We start at the top of the document, treating the first part as a chunk.\n",
|
"1. We start at the top of the document, treating the first part as a chunk.\n",
|
||||||
" 2. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n",
|
" 2. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n",
|
||||||
" 3. We keep this up until we reach the end of the document.\n",
|
" 3. We keep this up until we reach the end of the document.\n",
|
||||||
"Have you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n",
|
"The ultimate dream? Having an agent do this for you. But slow down! This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet. However, Greg Kamradt has his version available here.\n",
|
||||||
"The goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\n",
|
"\n",
|
||||||
"To put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\n",
|
"=== 1 ===\n",
|
||||||
"Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\n",
|
"Document Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\n",
|
||||||
|
"Document Specific Chunking can handle a variety of document formats, such as:\n",
|
||||||
|
"Markdown\n",
|
||||||
|
"HTML\n",
|
||||||
|
"Python\n",
|
||||||
|
"etc\n",
|
||||||
|
"Here we’ll take Markdown as our example and use a modified version of our first sample text:\n",
|
||||||
|
"\n",
|
||||||
|
"The result is the following:\n",
|
||||||
"You can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n",
|
"You can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n",
|
||||||
"And there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\n"
|
"\n",
|
||||||
|
"=== 2 ===\n",
|
||||||
|
"And there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\n",
|
||||||
|
"To put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\n",
|
||||||
|
"By the way, if you're eager to experiment with your own examples using the chunking visualisation tool featured in this blog, feel free to give it a try! You can access it right here. Enjoy, and happy chunking! 😉\n",
|
||||||
|
"\n",
|
||||||
|
"=== 3 ===\n",
|
||||||
|
"Retrieval Augmented Generation (RAG) has been a hot topic in understanding, interpreting, and generating text with AI for the last few months. It's like a wonderful union of retrieval-based and generative models, creating a playground for researchers, data scientists, and natural language processing enthusiasts, like you and me.\n",
|
||||||
|
"To truly control the results produced by our RAG, we need to understand chunking strategies and their role in the process of retrieving and generating text. Indeed, each chunking strategy enhances RAG's effectiveness in its unique way.\n",
|
||||||
|
"The goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\n",
|
||||||
|
"Let's explore some chunking strategies together.\n",
|
||||||
|
"The methods mentioned in the article you're about to read usually make use of two key parameters. First, we have [chunk_size]— which controls the size of your text chunks. Then there's [chunk_overlap], which takes care of how much text overlaps between one chunk and the next.\n",
|
||||||
|
"\n",
|
||||||
|
"=== 4 ===\n",
|
||||||
|
"Semantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information's integrity during retrieval, leading to a more accurate and contextually appropriate outcome.\n",
|
||||||
|
"Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\n",
|
||||||
|
"By focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's a top-notch choice when maintaining the semantic integrity of the text is vital.\n",
|
||||||
|
"However, this method does require more effort and is notably slower than the previous ones.\n",
|
||||||
|
"On our example text, since it is quite short and does not expose varied subjects, this method would only generate a single chunk.\n",
|
||||||
|
"\n",
|
||||||
|
"=== 5 ===\n",
|
||||||
|
"Language models used in the rest of your possible RAG pipeline have a token limit, which should not be exceeded. When dividing your text into chunks, it's advisable to count the number of tokens. Plenty of tokenizers are available. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.\n",
|
||||||
|
"Consequently, there are also splitters available for this purpose.\n",
|
||||||
|
"For instance, by using the [SpacyTextSplitter] from LangChain, the following chunks are created:\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"=== 6 ===\n",
|
||||||
|
"First things first, we have Character Chunking. This strategy divides the text into chunks based on a fixed number of characters. Its simplicity makes it a great starting point, but it can sometimes disrupt the text's flow, breaking sentences or words in unexpected places. Despite its limitations, it's a great stepping stone towards more advanced methods.\n",
|
||||||
|
"Now let’s see that in action with an example. Imagine a text that reads:\n",
|
||||||
|
"If we decide to set our chunk size to 100 and no chunk overlap, we'd end up with the following chunks. As you can see, Character Chunking can lead to some intriguing, albeit sometimes nonsensical, results, cutting some of the sentences in their middle.\n",
|
||||||
|
"By choosing a smaller chunk size, we would obtain more chunks, and by setting a bigger chunk overlap, we could obtain something like this:\n",
|
||||||
|
"\n",
|
||||||
|
"Also, by default this method creates chunks character by character based on the empty character [’ ’]. But you can specify a different one in order to chunk on something else, even a complete word! For instance, by specifying [' '] as the separator, you can avoid cutting words in their middle.\n",
|
||||||
|
"\n",
|
||||||
|
"=== 7 ===\n",
|
||||||
|
"Next, let's take a look at Recursive Character Chunking. Based on the basic concept of Character Chunking, this advanced version takes it up a notch by dividing the text into chunks until a certain condition is met, such as reaching a minimum chunk size. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes Recursive Character Chunking great for texts with varied structures.\n",
|
||||||
|
"Again, let’s use the same example in order to illustrate this method. With a chunk size of 100, and the default settings for the other parameters, we obtain the following chunks:\n",
|
||||||
|
"\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"points = client.query(COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10)\n",
|
"for i, point in enumerate(points):\n",
|
||||||
"\n",
|
" print(f\"=== {i} ===\")\n",
|
||||||
"print(\"<=== Retrieved documents ===>\")\n",
|
" print(point.document)\n",
|
||||||
"for point in points:\n",
|
" print()"
|
||||||
" print(point.document)"
|
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
@ -280,7 +291,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.13.0"
|
"version": "3.12.7"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
@ -31,6 +31,15 @@ Docling parses documents and exports them to the desired format with ease and sp
|
|||||||
* 📝 Metadata extraction, including title, authors, references & language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
* 🦜🔗 Native LangChain extension
|
* 🦜🔗 Native LangChain extension
|
||||||
|
|
||||||
|
## Get started
|
||||||
|
|
||||||
|
<div class="grid">
|
||||||
|
<a href="concepts/" class="card"><b>Concepts</b><br />Learn Docling fundamendals</a>
|
||||||
|
<a href="examples/" class="card"><b>Examples</b><br />Try out recipes for various use cases, including conversion, RAG, and more</a>
|
||||||
|
<a href="integrations/" class="card"><b>Integrations</b><br />Check out integrations with popular frameworks and tools</a>
|
||||||
|
<a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
## IBM ❤️ Open Source AI
|
## IBM ❤️ Open Source AI
|
||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
10
docs/integrations/crewai.md
Normal file
10
docs/integrations/crewai.md
Normal file
@ -0,0 +1,10 @@
|
|||||||
|
Docling is available in [CrewAI](https://www.crewai.com/) as the `CrewDoclingSource`
|
||||||
|
knowledge source.
|
||||||
|
|
||||||
|
- 💻 [Crew AI GitHub][github]
|
||||||
|
- 📖 [Crew AI knowledge docs][docs]
|
||||||
|
- 📦 [Crew AI PyPI][package]
|
||||||
|
|
||||||
|
[github]: https://github.com/crewAIInc/crewAI/
|
||||||
|
[docs]: https://docs.crewai.com/concepts/knowledge
|
||||||
|
[package]: https://pypi.org/project/crewai/
|
6
docs/integrations/nvidia.md
Normal file
6
docs/integrations/nvidia.md
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
Docling is powering the NVIDIA *PDF to Podcast* agentic AI blueprint:
|
||||||
|
|
||||||
|
- [🏠 PDF to Podcast home](https://build.nvidia.com/nvidia/pdf-to-podcast)
|
||||||
|
- [💻 PDF to Podcast GitHub](https://github.com/NVIDIA-AI-Blueprints/pdf-to-podcast)
|
||||||
|
- [📣 PDF to Podcast announcement](https://nvidianews.nvidia.com/news/nvidia-launches-ai-foundation-models-for-rtx-ai-pcs)
|
||||||
|
- [✍️ PDF to Podcast blog post](https://blogs.nvidia.com/blog/agentic-ai-blueprints/)
|
@ -1,10 +1,5 @@
|
|||||||
Docling is available an ingestion engine for [OpenContracts](https://github.com/JSv4/OpenContracts), allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:
|
Docling is available an ingestion engine for [OpenContracts](https://github.com/JSv4/OpenContracts), allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:
|
||||||
|
|
||||||
- 💻 [GitHub](https://github.com/JSv4/OpenContracts)
|
- 💻 [OpenContracts GitHub](https://github.com/JSv4/OpenContracts)
|
||||||
- 📖 [Docs](https://jsv4.github.io/OpenContracts/)]
|
- 📖 [OpenContracts Docs](https://jsv4.github.io/OpenContracts/)
|
||||||
|
- ▶️ [OpenContracts x Docling PDF annotation screen capture](https://github.com/JSv4/OpenContracts/blob/main/docs/assets/images/gifs/PDF%20Annotation%20Flow.gif)
|
||||||
#### Docling in Action!
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
|
|
||||||
|
41
mkdocs.yml
41
mkdocs.yml
@ -65,7 +65,7 @@ nav:
|
|||||||
- Chunking: concepts/chunking.md
|
- Chunking: concepts/chunking.md
|
||||||
- Examples:
|
- Examples:
|
||||||
- Examples: examples/index.md
|
- Examples: examples/index.md
|
||||||
- Conversion:
|
- 🔀 Conversion:
|
||||||
- "Simple conversion": examples/minimal.py
|
- "Simple conversion": examples/minimal.py
|
||||||
- "Custom conversion": examples/custom_convert.py
|
- "Custom conversion": examples/custom_convert.py
|
||||||
- "Batch conversion": examples/batch_convert.py
|
- "Batch conversion": examples/batch_convert.py
|
||||||
@ -76,31 +76,36 @@ nav:
|
|||||||
- "Multimodal export": examples/export_multimodal.py
|
- "Multimodal export": examples/export_multimodal.py
|
||||||
- "Force full page OCR": examples/full_page_ocr.py
|
- "Force full page OCR": examples/full_page_ocr.py
|
||||||
- "Accelerator options": examples/run_with_accelerator.py
|
- "Accelerator options": examples/run_with_accelerator.py
|
||||||
- Chunking:
|
- ✂️ Chunking:
|
||||||
- "Hybrid chunking": examples/hybrid_chunking.ipynb
|
- "Hybrid chunking": examples/hybrid_chunking.ipynb
|
||||||
- RAG / QA:
|
- 💬 RAG / QA:
|
||||||
- "RAG with Haystack": examples/rag_haystack.ipynb
|
- examples/rag_haystack.ipynb
|
||||||
- "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
|
- examples/rag_llamaindex.ipynb
|
||||||
- "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
|
- examples/rag_langchain.ipynb
|
||||||
- "RAG with Weaviate": examples/rag_weaviate.ipynb
|
- examples/rag_weaviate.ipynb
|
||||||
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
|
- examples/retrieval_qdrant.ipynb
|
||||||
- Integrations:
|
- Integrations:
|
||||||
- Integrations: integrations/index.md
|
- Integrations: integrations/index.md
|
||||||
- "🐝 Bee": integrations/bee.md
|
- 🤖 Agentic / AI dev frameworks:
|
||||||
- "Cloudera": integrations/cloudera.md
|
- "Bee Agent Framework": integrations/bee.md
|
||||||
- "Data Prep Kit": integrations/data_prep_kit.md
|
- "Crew AI": integrations/crewai.md
|
||||||
- "DocETL": integrations/docetl.md
|
|
||||||
- "Haystack": integrations/haystack.md
|
- "Haystack": integrations/haystack.md
|
||||||
- "🐶 InstructLab": integrations/instructlab.md
|
# - "LangChain": integrations/langchain.md
|
||||||
- "Kotaemon": integrations/kotaemon.md
|
- "LlamaIndex": integrations/llamaindex.md
|
||||||
- "🦙 LlamaIndex": integrations/llamaindex.md
|
- "txtai": integrations/txtai.md
|
||||||
- "OpenContracts": integrations/opencontracts.md
|
- ⭐️ Featured:
|
||||||
|
- "Data Prep Kit": integrations/data_prep_kit.md
|
||||||
|
- "InstructLab": integrations/instructlab.md
|
||||||
|
- "NVIDIA": integrations/nvidia.md
|
||||||
- "Prodigy": integrations/prodigy.md
|
- "Prodigy": integrations/prodigy.md
|
||||||
- "RHEL AI": integrations/rhel_ai.md
|
- "RHEL AI": integrations/rhel_ai.md
|
||||||
- "spaCy": integrations/spacy.md
|
- "spaCy": integrations/spacy.md
|
||||||
- "txtai": integrations/txtai.md
|
- 🗂️ More integrations:
|
||||||
|
- "Cloudera": integrations/cloudera.md
|
||||||
|
- "DocETL": integrations/docetl.md
|
||||||
|
- "Kotaemon": integrations/kotaemon.md
|
||||||
|
- "OpenContracts": integrations/opencontracts.md
|
||||||
- "Vectara": integrations/vectara.md
|
- "Vectara": integrations/vectara.md
|
||||||
# - "LangChain 🦜🔗": integrations/langchain.md
|
|
||||||
- Reference:
|
- Reference:
|
||||||
- Python API:
|
- Python API:
|
||||||
- Document Converter: reference/document_converter.md
|
- Document Converter: reference/document_converter.md
|
||||||
|
Loading…
Reference in New Issue
Block a user