Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Go to file
Václav Vančura 772487f9c9
feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas  (#719)

fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* chore: bump version to 2.15.1 [skip ci]

* Actor: Initial implementation

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: .dockerignore update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Actor badge

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Moving the badge where it belongs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Switching Docker to python:3.11-slim-bookworm

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Docker security with proper user permissions

- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Optimize Dockerfile with security and size improvements

- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add Docker image metadata labels

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update dependencies with fixed versions

Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix apify-cli version problem

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Create Apify user home directory in Docker setup

Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update Docker configuration for improved security

- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve shell script robustness and error handling

The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve script logging and error handling

- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updating Docling to 2.17.0

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Dockerfile with additional utilities and env vars

- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Apify FirstPromoter integration

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the "Run on Apify" button

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixing example PDF document URLs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding input document URL validation

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix quoting in `DOC_CONVERT_CMD` variable

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add specific error codes for better error handling

- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance error handling and data logging

- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Normalize key-value store terminology

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance `README.md` with output details

Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding CHANGELOG.md

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding dataset schema

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update README with output URL details

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix the Apify call syntax and final result URL message

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add section on Actors to README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Replace Docling CLI with docling-serve API

This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Overhaul the implementation using official docling-serve image

This commit completely revamps the Actor implementation with two major improvements:

1) CRITICAL CHANGE: Switch to official docling-serve image
   * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
   * Eliminates need for custom docling installation
   * Ensures compatibility with latest docling-serve features
   * Provides more reliable and consistent document processing

2) Fix Apify Actor KVS storage issues:
   * Standardize key names to follow Apify conventions:
     - Change "OUTPUT_RESULT" to "OUTPUT"
     - Change "DOCLING_LOG" to "LOG"
   * Add proper multi-stage Docker build:
     - First stage builds dependencies including apify-cli
     - Second stage uses official image and adds only necessary tools
   * Fix permission issues in Docker container:
     - Set up proper user and directory permissions
     - Create writable directories for temporary files and models
     - Configure environment variables for proper execution

3) Solve EACCES permission errors during CLI version checks:
   * Create temporary HOME directory with proper write permissions
   * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
   * Add NODE_OPTIONS="--no-warnings" to suppress update checks
   * Support --no-update-notifier CLI flag when available

4) Improve code organization and reliability:
   * Create reusable upload_to_kvs() function for all KVS operations
   * Ensure log files are uploaded before tools directory is removed
   * Set proper MIME types based on output format
   * Add detailed error reporting and proper cleanup
   * Display final output URLs for easy verification

This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Refactor `actor.sh` and add `docling_processor.py`

Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update CHANGELOG and README for Docker and API changes

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Removing obsolete actor.json keys

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixed input getter

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Always output a zip

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Resolving conflicts with main

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Resolving conflicts with main (pass 2)

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updated main Readme and Actor Readme

Signed-off-by: Adam Kliment <adam@netmilk.net>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
Signed-off-by: Václav Vančura <commit@vancura.dev>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adam Kliment <adam@netmilk.net>
2025-03-18 10:17:44 +01:00
.actor feat(actor): Docling Actor on Apify infrastructure (#875) 2025-03-18 10:17:44 +01:00
.github chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
docling chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
docs docs: fix spelling of picture in usage (#1165) 2025-03-17 09:33:51 +01:00
tests chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
.gitignore ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml feat!: Docling v2 (#117) 2024-10-16 21:02:03 +02:00
CHANGELOG.md chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
CITATION.cff chore: add downloads in README, security policy and update ci actions (#401) 2024-11-21 13:59:45 +01:00
CODE_OF_CONDUCT.md Initial commit 2024-07-15 09:42:42 +02:00
CONTRIBUTING.md chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
Dockerfile docs: update example Dockerfile with download CLI (#929) 2025-02-13 14:19:50 +01:00
LICENSE chore: fix placeholders in license (#63) 2024-09-06 17:10:07 +02:00
MAINTAINERS.md docs: Update MAINTAINERS.md (#59) 2024-09-02 12:34:38 +02:00
mkdocs.yml chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
poetry.lock feat: equations to latex in MSWord backend (with inline groups) (#1114) 2025-03-13 15:12:22 +01:00
pyproject.toml chore: move to docling-project org (#1160) 2025-03-14 12:35:29 +01:00
README.md feat(actor): Docling Actor on Apify infrastructure (#875) 2025-03-18 10:17:44 +01:00

Docling

Docling

DS4SD%2Fdocling | Trendshift

arXiv Docs PyPI version PyPI - Python Version Poetry Code style: black Imports: isort Pydantic v2 pre-commit License MIT PyPI Downloads Docling Actor

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Features

  • 🗂️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
  • 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
  • 🧬 Unified, expressive DoclingDocument representation format
  • ↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
  • 🔒 Local execution capabilities for sensitive data and air-gapped environments
  • 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
  • 🔍 Extensive OCR support for scanned PDFs and images
  • 💻 Simple and convenient CLI

Coming soon

  • 📝 Metadata extraction, including title, authors, references & language
  • 📝 Inclusion of Visual Language Models (SmolDocling)
  • 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
  • 📝 Complex chemistry understanding (Molecular structures)

Installation

To use Docling, simply install docling from your package manager, e.g. pip:

pip install docling

Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.

More detailed installation instructions are available in the docs.

Getting started

To convert individual documents, use convert(), for example:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

More advanced usage options are available in the docs.

Documentation

Check out Docling's documentation, for details on installation, usage, concepts, recipes, extensions, and more.

Examples

Go hands-on with our examples, demonstrating how to address different application use cases with Docling.

Integrations

To further accelerate your AI application development, check out Docling's native integrations with popular frameworks and tools.

Apify Actor

Run Docling Actor on Apify

You can run Docling in the cloud without installation using the Docling Actor on Apify platform. Simply provide a document URL and get the processed result:

apify call vancura/docling -i '{
  "options": {
    "to_formats": ["md", "json", "html", "text", "doctags"]
  },
  "http_sources": [
    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
    {"url": "https://arxiv.org/pdf/2408.09869"}
  ]
}'

The Actor stores results in:

  • Processed document in key-value store (OUTPUT_RESULT)
  • Processing logs (DOCLING_LOG)
  • Dataset record with result URL and status

Read more about the Docling Actor, including how to use it via the Apify API and CLI.

Get help and support

Please feel free to connect with us using the discussion section.

Technical report

For more details on Docling's inner workings, check out the Docling Technical Report.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

IBM ❤️ Open Source AI

Docling has been brought to you by IBM.