From 28d1c746a6e0ce033861ecb9e8844ff6c9194f83 Mon Sep 17 00:00:00 2001 From: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Date: Thu, 18 Jul 2024 11:23:23 +0200 Subject: [PATCH] chore: update README (#13) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 470936a..0e80b3b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@

- Docling + + Docling +

# Docling @@ -11,7 +13,7 @@ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) -[![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT) +[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT) Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. @@ -49,7 +51,7 @@ The output of the above command will be written to `./scratch`. ### Adjust pipeline features -**Control pipeline options** +#### Control pipeline options You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`: ```python @@ -62,16 +64,15 @@ doc_converter = DocumentConverter( ) ``` -**Control table extraction options** +#### Control table extraction options You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one. ```python - pipeline_options = PipelineOptions(do_table_structure=True) -pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model +pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model doc_converter = DocumentConverter( artifacts_path=artifacts_path,