docs: reflect supported Python versions, add badges (#10)
* docs: reflect supported Python versions, add badges Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * minor HTML fix Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
This commit is contained in:
parent
0dfa4548d3
commit
2baa35c548
55
README.md
55
README.md
@ -1,9 +1,18 @@
|
|||||||
<p align="center">
|
<p align="center">
|
||||||
<a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
|
<a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
# Docling
|
# Docling
|
||||||
|
|
||||||
|
[](https://pypi.org/project/docling/)
|
||||||
|

|
||||||
|
[](https://python-poetry.org/)
|
||||||
|
[](https://github.com/psf/black)
|
||||||
|
[](https://pycqa.github.io/isort/)
|
||||||
|
[](https://pydantic.dev)
|
||||||
|
[](https://github.com/pre-commit/pre-commit)
|
||||||
|
[](https://opensource.org/licenses/MIT)
|
||||||
|
|
||||||
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
@ -12,25 +21,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
|
|||||||
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
* 📝 Extracts metadata from the document, such as title, authors, references and language
|
||||||
* 🔍 Optionally applies OCR (use with scanned PDFs)
|
* 🔍 Optionally applies OCR (use with scanned PDFs)
|
||||||
|
|
||||||
## Setup
|
## Installation
|
||||||
|
|
||||||
For general usage, you can simply install `docling` through `pip` from the pypi package index.
|
To use Docling, simply install `docling` from your package manager, e.g. pip:
|
||||||
```
|
```bash
|
||||||
pip install docling
|
pip install docling
|
||||||
```
|
```
|
||||||
|
|
||||||
**Notes**:
|
> [!NOTE]
|
||||||
* Works on macOS and Linux environments. Windows platforms are currently not tested.
|
> Works on macOS and Linux environments. Windows platforms are currently not tested.
|
||||||
|
|
||||||
### Development setup
|
### Development setup
|
||||||
|
|
||||||
To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
|
To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
|
||||||
|
|
||||||
Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
poetry env use $(which python3.11)
|
|
||||||
poetry shell
|
|
||||||
poetry install
|
poetry install
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -45,23 +49,24 @@ The output of the above command will be written to `./scratch`.
|
|||||||
|
|
||||||
### Enable or disable pipeline features
|
### Enable or disable pipeline features
|
||||||
|
|
||||||
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
|
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
|
||||||
```python
|
```python
|
||||||
doc_converter = DocumentConverter(
|
doc_converter = DocumentConverter(
|
||||||
artifacts_path=artifacts_path,
|
artifacts_path=artifacts_path,
|
||||||
pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
|
pipeline_options=PipelineOptions(
|
||||||
do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
|
do_table_structure=False, # controls if table structure is recovered
|
||||||
|
do_ocr=True, # controls if OCR is applied (ignores programmatic content)
|
||||||
|
),
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Impose limits on the document size
|
### Impose limits on the document size
|
||||||
|
|
||||||
You can limit the file size and number of pages which should be allowed to process per document.
|
You can limit the file size and number of pages which should be allowed to process per document:
|
||||||
```python
|
```python
|
||||||
paths = [Path("./test/data/2206.01062.pdf")]
|
conv_input = DocumentConversionInput.from_paths(
|
||||||
|
paths=[Path("./test/data/2206.01062.pdf")],
|
||||||
input = DocumentConversionInput.from_paths(
|
limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
|
||||||
paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -71,12 +76,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
|
|||||||
```python
|
```python
|
||||||
buf = BytesIO(your_binary_stream)
|
buf = BytesIO(your_binary_stream)
|
||||||
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
|
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
|
||||||
input = DocumentConversionInput.from_streams(docs)
|
conv_input = DocumentConversionInput.from_streams(docs)
|
||||||
converted_docs = doc_converter.convert(input)
|
converted_docs = doc_converter.convert(conv_input)
|
||||||
```
|
```
|
||||||
### Limit resource usage
|
### Limit resource usage
|
||||||
|
|
||||||
You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||||
|
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
@ -86,7 +91,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
|
|||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
If you use `Docling` in your projects, please consider citing the following:
|
If you use Docling in your projects, please consider citing the following:
|
||||||
|
|
||||||
```bib
|
```bib
|
||||||
@software{Docling,
|
@software{Docling,
|
||||||
@ -101,5 +106,5 @@ year = {2024}
|
|||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
The `Docling` codebase is under MIT license.
|
The Docling codebase is under MIT license.
|
||||||
For individual model usage, please refer to the model licenses found in the original packages.
|
For individual model usage, please refer to the model licenses found in the original packages.
|
||||||
|
Loading…
Reference in New Issue
Block a user