
* fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> * chore: bump version to 2.15.1 [skip ci] * Actor: Initial implementation Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: .dockerignore update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Actor badge Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Moving the badge where it belongs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Switching Docker to python:3.11-slim-bookworm Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Docker security with proper user permissions - Set proper ownership and permissions for runtime directory. - Switch to non-root user for enhanced security. - Use `--chown` flag in COPY commands to maintain correct file ownership. - Ensure all files and directories are owned by `appuser`. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Optimize Dockerfile with security and size improvements - Combine RUN commands to reduce image layers and overall size. - Add non-root user `appuser` for improved security. - Use `--no-install-recommends` flag to minimize installed packages. - Install only necessary dependencies in a single RUN command. - Maintain proper cleanup of package lists and caches. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add Docker image metadata labels Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update dependencies with fixed versions Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix apify-cli version problem Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Create Apify user home directory in Docker setup Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update Docker configuration for improved security - Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning. - Improve readability with consistent formatting and spacing in RUN commands. - Enhance security by properly setting up appuser home directory and permissions. - Streamline directory structure and ownership for runtime operations. - Remove redundant `.apify` directory creation as it's handled by the CLI. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve shell script robustness and error handling The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include: - Added proper quoting around variables to prevent word splitting. - Improved error messages and logging functionality. - Implemented a cleanup trap to ensure temporary files are removed. - Enhanced validation of input parameters and output formats. - Added better handling of the log file and its storage. - Improved command execution with proper evaluation. - Added comments for better code readability and maintenance. - Fixed potential security issues with proper variable expansion. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve script logging and error handling - Initialize log file at `/tmp/docling.log` and redirect all output to it - Remove exit on error trap, now only logs error line numbers - Use temporary directory for timestamp file - Capture Docling exit code and handle errors more gracefully - Update log file references to use `LOG_FILE` variable - Remove local log file during cleanup Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updating Docling to 2.17.0 Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Dockerfile with additional utilities and env vars - Add installation of `time` and `procps` packages for better resource monitoring. - Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance. - Create a cache directory for EasyOCR to optimize storage usage. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Apify FirstPromoter integration Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the "Run on Apify" button Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixing example PDF document URLs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding input document URL validation Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix quoting in `DOC_CONVERT_CMD` variable Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add specific error codes for better error handling - `ERR_INVALID_INPUT` for missing document URL - `ERR_URL_INACCESSIBLE` for inaccessible URLs - `ERR_DOCLING_FAILED` for Docling command failures - `ERR_OUTPUT_MISSING` for missing or empty output files - `ERR_STORAGE_FAILED` for failures in storing the output document Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance error handling and data logging - Add `apify pushData` calls to log errors when the document URL is missing or inaccessible. - Introduce dataset record creation with processing results, including a success status and output file URL. - Modify completion message to indicate successful processing and provide a link to the results. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Normalize key-value store terminology Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance `README.md` with output details Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding CHANGELOG.md Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding dataset schema Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update README with output URL details Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix the Apify call syntax and final result URL message Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add section on Actors to README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Replace Docling CLI with docling-serve API This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include: - Redesign Dockerfile to use docling-serve as base image - Update actor.sh to communicate with API instead of running CLI commands - Improve content type handling for various output formats - Update input schema to align with API parameters - Reduce Docker image size from ~6GB to ~600MB - Update documentation and changelog to reflect architectural changes The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities. Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Overhaul the implementation using official docling-serve image This commit completely revamps the Actor implementation with two major improvements: 1) CRITICAL CHANGE: Switch to official docling-serve image * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image * Eliminates need for custom docling installation * Ensures compatibility with latest docling-serve features * Provides more reliable and consistent document processing 2) Fix Apify Actor KVS storage issues: * Standardize key names to follow Apify conventions: - Change "OUTPUT_RESULT" to "OUTPUT" - Change "DOCLING_LOG" to "LOG" * Add proper multi-stage Docker build: - First stage builds dependencies including apify-cli - Second stage uses official image and adds only necessary tools * Fix permission issues in Docker container: - Set up proper user and directory permissions - Create writable directories for temporary files and models - Configure environment variables for proper execution 3) Solve EACCES permission errors during CLI version checks: * Create temporary HOME directory with proper write permissions * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable * Add NODE_OPTIONS="--no-warnings" to suppress update checks * Support --no-update-notifier CLI flag when available 4) Improve code organization and reliability: * Create reusable upload_to_kvs() function for all KVS operations * Ensure log files are uploaded before tools directory is removed * Set proper MIME types based on output format * Add detailed error reporting and proper cleanup * Display final output URLs for easy verification This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Refactor `actor.sh` and add `docling_processor.py` Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update CHANGELOG and README for Docker and API changes Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Removing obsolete actor.json keys Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixed input getter Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Always output a zip Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Resolving conflicts with main Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Resolving conflicts with main (pass 2) Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updated main Readme and Actor Readme Signed-off-by: Adam Kliment <adam@netmilk.net> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> Signed-off-by: Václav Vančura <commit@vancura.dev> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Adam Kliment <adam@netmilk.net>
2.5 KiB
2.5 KiB
Changelog
All notable changes to the Docling Actor will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.1.0] - 2025-03-09
Changed
- Switched from full Docling CLI to docling-serve API
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
- Reduced Docker image size (from ~6GB to ~4GB)
- Implemented multi-stage Docker build to handle dependencies
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
- Added new Python processor script for reliable API communication and content extraction
- Enhanced response handling with better content extraction logic
- Fixed ES modules compatibility issue with Apify CLI
- Added explicit tmpfs volume for temporary files
- Fixed environment variables format in actor.json
- Created optimized dependency installation approach
- Improved API compatibility with docling-serve
- Updated endpoint from custom
/convert
to standard/v1alpha/convert/source
- Revised JSON payload structure to match docling-serve API format
- Added proper output field parsing based on format
- Updated endpoint from custom
- Enhanced startup process with health checks
- Added configurable API host and port through environment variables
- Better content type handling for different output formats
- Updated error handling to align with API responses
Fixed
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
Technical Details
- Actor Specification v1
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
- Node.js 20.x for Apify CLI
- Eliminated Python dependencies
- Simplified Docker build process
[1.0.0] - 2025-02-07
Added
- Initial release of Docling Actor
- Support for multiple document formats (PDF, DOCX, images)
- OCR capabilities for scanned documents
- Multiple output formats (md, json, html, text, doctags)
- Comprehensive error handling and logging
- Dataset records with processing status
- Memory monitoring and resource optimization
- Security features including non-root user execution
Technical Details
- Actor Specification v1
- Docling v2.17.0
- Python 3.11
- Node.js 20.x
- Comprehensive error codes:
- 10: Invalid input
- 11: URL inaccessible
- 12: Docling processing failed
- 13: Output file missing
- 14: Storage operation failed
- 15: OCR processing failed