Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Go to file
Panos Vagenas 2baa35c548
docs: reflect supported Python versions, add badges (#10)
* docs: reflect supported Python versions, add badges

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor HTML fix

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-07-17 15:49:26 +02:00
.github feat: enable python 3.12 support by updating glm (#8) 2024-07-17 14:03:26 +02:00
docling ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
examples Update convert.py (#3) 2024-07-15 18:02:42 +02:00
test Initial commit 2024-07-15 09:42:42 +02:00
.gitignore ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml Initial commit 2024-07-15 09:42:42 +02:00
CHANGELOG.md chore: bump version to 0.3.0 [skip ci] 2024-07-17 12:11:15 +00:00
CODE_OF_CONDUCT.md Initial commit 2024-07-15 09:42:42 +02:00
CONTRIBUTING.md Initial commit 2024-07-15 09:42:42 +02:00
Dockerfile doc: More documentation updates (#2) 2024-07-15 14:59:53 +02:00
LICENSE Initial commit 2024-07-15 09:42:42 +02:00
logo.png Initial commit 2024-07-15 09:42:42 +02:00
MAINTAINERS.md Initial commit 2024-07-15 09:42:42 +02:00
poetry.lock feat: enable python 3.12 support by updating glm (#8) 2024-07-17 14:03:26 +02:00
pyproject.toml chore: bump version to 0.3.0 [skip ci] 2024-07-17 12:11:15 +00:00
README.md docs: reflect supported Python versions, add badges (#10) 2024-07-17 15:49:26 +02:00

Docling

Docling

PyPI version Python Poetry Code style: black Imports: isort Pydantic v2 pre-commit License MIT

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

  • Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • 📑 Understands detailed page layout, reading order and recovers table structures
  • 📝 Extracts metadata from the document, such as title, authors, references and language
  • 🔍 Optionally applies OCR (use with scanned PDFs)

Installation

To use Docling, simply install docling from your package manager, e.g. pip:

pip install docling

Note

Works on macOS and Linux environments. Windows platforms are currently not tested.

Development setup

To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:

poetry install

Usage

For basic usage, see the convert.py example module. Run with:

python examples/convert.py

The output of the above command will be written to ./scratch.

Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter:

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(
        do_table_structure=False,  # controls if table structure is recovered 
        do_ocr=True,  # controls if OCR is applied (ignores programmatic content)
    ),
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document:

conv_input = DocumentConversionInput.from_paths(
    paths=[Path("./test/data/2206.01062.pdf")],
    limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
conv_input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(conv_input)

Limit resource usage

You can limit the CPU threads used by Docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.