![]() * add Github Actions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply styling Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update .github/actions/setup-poetry/action.yml Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * add semantic-release config Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> |
||
---|---|---|
.github | ||
docling | ||
examples | ||
test | ||
.gitignore | ||
.pre-commit-config.yaml | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
Dockerfile | ||
LICENSE | ||
logo.png | ||
MAINTAINERS.md | ||
poetry.lock | ||
pyproject.toml | ||
README.md |
Docling
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
Features
- ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
- 📑 Understands detailed page layout, reading order and recovers table structures
- 📝 Extracts metadata from the document, such as title, authors, references and language
- 🔍 Optionally applies OCR (use with scanned PDFs)
Setup
You need Python 3.11 and poetry. Install poetry from here.
Once you have poetry
installed, create an environment and install the package:
poetry env use $(which python3.11)
poetry shell
poetry install
Notes:
- Works on macOS and Linux environments. Windows platforms are currently not tested.
Usage
For basic usage, see the convert.py example module. Run with:
python examples/convert.py
The output of the above command will be written to ./scratch
.
Enable or disable pipeline features
You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter
doc_converter = DocumentConverter(
artifacts_path=artifacts_path,
pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)
Impose limits on the document size
You can limit the file size and number of pages which should be allowed to process per document.
paths = [Path("./test/data/2206.01062.pdf")]
input = DocumentConversionInput.from_paths(
paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)
Convert from binary PDF streams
You can convert PDFs from a binary stream instead of from the filesystem as follows:
buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)
Limit resource usage
You can limit the CPU threads used by docling
by setting the environment variable OMP_NUM_THREADS
accordingly. The default setting is using 4 CPU threads.
Contributing
Please read Contributing to Docling for details.
References
If you use Docling
in your projects, please consider citing the following:
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
License
The Docling
codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.