From 2baa35c548dd6d15dba449eb1dc707f8f08c0a2a Mon Sep 17 00:00:00 2001 From: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Date: Wed, 17 Jul 2024 15:49:26 +0200 Subject: [PATCH] docs: reflect supported Python versions, add badges (#10) * docs: reflect supported Python versions, add badges Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * minor HTML fix Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --- README.md | 55 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 30 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index f70c015..847ede6 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,18 @@

- Docling + Docling

# Docling +[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/) +![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue) +[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/) +[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) +[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) +[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev) +[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) +[![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT) + Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. ## Features @@ -12,25 +21,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co * 📝 Extracts metadata from the document, such as title, authors, references and language * 🔍 Optionally applies OCR (use with scanned PDFs) -## Setup +## Installation -For general usage, you can simply install `docling` through `pip` from the pypi package index. -``` +To use Docling, simply install `docling` from your package manager, e.g. pip: +```bash pip install docling ``` -**Notes**: -* Works on macOS and Linux environments. Windows platforms are currently not tested. +> [!NOTE] +> Works on macOS and Linux environments. Windows platforms are currently not tested. ### Development setup -To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer). - -Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root: - +To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir: ```bash -poetry env use $(which python3.11) -poetry shell poetry install ``` @@ -45,23 +49,24 @@ The output of the above command will be written to `./scratch`. ### Enable or disable pipeline features -You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter` +You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`: ```python doc_converter = DocumentConverter( artifacts_path=artifacts_path, - pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. - do_ocr=True), # Controls if OCR is applied (ignores programmatic content) + pipeline_options=PipelineOptions( + do_table_structure=False, # controls if table structure is recovered + do_ocr=True, # controls if OCR is applied (ignores programmatic content) + ), ) ``` ### Impose limits on the document size -You can limit the file size and number of pages which should be allowed to process per document. +You can limit the file size and number of pages which should be allowed to process per document: ```python -paths = [Path("./test/data/2206.01062.pdf")] - -input = DocumentConversionInput.from_paths( - paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520) +conv_input = DocumentConversionInput.from_paths( + paths=[Path("./test/data/2206.01062.pdf")], + limits=DocumentLimits(max_num_pages=100, max_file_size=20971520) ) ``` @@ -71,12 +76,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll ```python buf = BytesIO(your_binary_stream) docs = [DocumentStream(filename="my_doc.pdf", stream=buf)] -input = DocumentConversionInput.from_streams(docs) -converted_docs = doc_converter.convert(input) +conv_input = DocumentConversionInput.from_streams(docs) +converted_docs = doc_converter.convert(conv_input) ``` ### Limit resource usage -You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. +You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. ## Contributing @@ -86,7 +91,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main ## References -If you use `Docling` in your projects, please consider citing the following: +If you use Docling in your projects, please consider citing the following: ```bib @software{Docling, @@ -101,5 +106,5 @@ year = {2024} ## License -The `Docling` codebase is under MIT license. +The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.