Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Go to file
Michele Dolfi e45dc5d1a5
ci: Add Github Actions (#4)
* add Github Actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply styling

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update .github/actions/setup-poetry/action.yml

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* add semantic-release config

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-16 13:05:04 +02:00
.github ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
docling ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
examples Update convert.py (#3) 2024-07-15 18:02:42 +02:00
test Initial commit 2024-07-15 09:42:42 +02:00
.gitignore ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml Initial commit 2024-07-15 09:42:42 +02:00
CODE_OF_CONDUCT.md Initial commit 2024-07-15 09:42:42 +02:00
CONTRIBUTING.md Initial commit 2024-07-15 09:42:42 +02:00
Dockerfile doc: More documentation updates (#2) 2024-07-15 14:59:53 +02:00
LICENSE Initial commit 2024-07-15 09:42:42 +02:00
logo.png Initial commit 2024-07-15 09:42:42 +02:00
MAINTAINERS.md Initial commit 2024-07-15 09:42:42 +02:00
poetry.lock Initial commit 2024-07-15 09:42:42 +02:00
pyproject.toml ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
README.md doc: More documentation updates (#2) 2024-07-15 14:59:53 +02:00

Docling

Docling

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

  • Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • 📑 Understands detailed page layout, reading order and recovers table structures
  • 📝 Extracts metadata from the document, such as title, authors, references and language
  • 🔍 Optionally applies OCR (use with scanned PDFs)

Setup

You need Python 3.11 and poetry. Install poetry from here.

Once you have poetry installed, create an environment and install the package:

poetry env use $(which python3.11)
poetry shell
poetry install

Notes:

  • Works on macOS and Linux environments. Windows platforms are currently not tested.

Usage

For basic usage, see the convert.py example module. Run with:

python examples/convert.py

The output of the above command will be written to ./scratch.

Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.

paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)

Limit resource usage

You can limit the CPU threads used by docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.