NeoAnd/Docling

Fork 0

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Go to file

Michele Dolfi e45dc5d1a5 ci: Add Github Actions (#4 ) * add Github Actions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply styling Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update .github/actions/setup-poetry/action.yml Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * add semantic-release config Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>		2024-07-16 13:05:04 +02:00
.github	ci: Add Github Actions (#4 )	2024-07-16 13:05:04 +02:00
docling	ci: Add Github Actions (#4 )	2024-07-16 13:05:04 +02:00
examples	Update convert.py (#3 )	2024-07-15 18:02:42 +02:00
test	Initial commit	2024-07-15 09:42:42 +02:00
.gitignore	ci: Add Github Actions (#4 )	2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml	Initial commit	2024-07-15 09:42:42 +02:00
CODE_OF_CONDUCT.md	Initial commit	2024-07-15 09:42:42 +02:00
CONTRIBUTING.md	Initial commit	2024-07-15 09:42:42 +02:00
Dockerfile	doc: More documentation updates (#2 )	2024-07-15 14:59:53 +02:00
LICENSE	Initial commit	2024-07-15 09:42:42 +02:00
logo.png	Initial commit	2024-07-15 09:42:42 +02:00
MAINTAINERS.md	Initial commit	2024-07-15 09:42:42 +02:00
poetry.lock	Initial commit	2024-07-15 09:42:42 +02:00
pyproject.toml	ci: Add Github Actions (#4 )	2024-07-16 13:05:04 +02:00
README.md	doc: More documentation updates (#2 )	2024-07-15 14:59:53 +02:00

README.md

Docling

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
📑 Understands detailed page layout, reading order and recovers table structures
📝 Extracts metadata from the document, such as title, authors, references and language
🔍 Optionally applies OCR (use with scanned PDFs)

Setup

You need Python 3.11 and poetry. Install poetry from here.

Once you have poetry installed, create an environment and install the package:

poetry env use $(which python3.11)
poetry shell
poetry install

Notes:

Works on macOS and Linux environments. Windows platforms are currently not tested.

Usage

For basic usage, see the convert.py example module. Run with:

python examples/convert.py

The output of the above command will be written to ./scratch.

Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.

paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)

Limit resource usage

You can limit the CPU threads used by docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.