feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719) fix: Properly care for all bitmap elements in OCR Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> * chore: bump version to 2.15.1 [skip ci] * Actor: Initial implementation Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: .dockerignore update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Actor badge Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Moving the badge where it belongs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Switching Docker to python:3.11-slim-bookworm Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Docker security with proper user permissions - Set proper ownership and permissions for runtime directory. - Switch to non-root user for enhanced security. - Use `--chown` flag in COPY commands to maintain correct file ownership. - Ensure all files and directories are owned by `appuser`. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Optimize Dockerfile with security and size improvements - Combine RUN commands to reduce image layers and overall size. - Add non-root user `appuser` for improved security. - Use `--no-install-recommends` flag to minimize installed packages. - Install only necessary dependencies in a single RUN command. - Maintain proper cleanup of package lists and caches. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add Docker image metadata labels Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update dependencies with fixed versions Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix apify-cli version problem Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Create Apify user home directory in Docker setup Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update Docker configuration for improved security - Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning. - Improve readability with consistent formatting and spacing in RUN commands. - Enhance security by properly setting up appuser home directory and permissions. - Streamline directory structure and ownership for runtime operations. - Remove redundant `.apify` directory creation as it's handled by the CLI. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve shell script robustness and error handling The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include: - Added proper quoting around variables to prevent word splitting. - Improved error messages and logging functionality. - Implemented a cleanup trap to ensure temporary files are removed. - Enhanced validation of input parameters and output formats. - Added better handling of the log file and its storage. - Improved command execution with proper evaluation. - Added comments for better code readability and maintenance. - Fixed potential security issues with proper variable expansion. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Improve script logging and error handling - Initialize log file at `/tmp/docling.log` and redirect all output to it - Remove exit on error trap, now only logs error line numbers - Use temporary directory for timestamp file - Capture Docling exit code and handle errors more gracefully - Update log file references to use `LOG_FILE` variable - Remove local log file during cleanup Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updating Docling to 2.17.0 Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance Dockerfile with additional utilities and env vars - Add installation of `time` and `procps` packages for better resource monitoring. - Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance. - Create a cache directory for EasyOCR to optimize storage usage. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: README update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the Apify FirstPromoter integration Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding the "Run on Apify" button Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixing example PDF document URLs Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding input document URL validation Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix quoting in `DOC_CONVERT_CMD` variable Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Documentation update Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add specific error codes for better error handling - `ERR_INVALID_INPUT` for missing document URL - `ERR_URL_INACCESSIBLE` for inaccessible URLs - `ERR_DOCLING_FAILED` for Docling command failures - `ERR_OUTPUT_MISSING` for missing or empty output files - `ERR_STORAGE_FAILED` for failures in storing the output document Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance error handling and data logging - Add `apify pushData` calls to log errors when the document URL is missing or inaccessible. - Introduce dataset record creation with processing results, including a success status and output file URL. - Modify completion message to indicate successful processing and provide a link to the results. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Normalize key-value store terminology Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Enhance `README.md` with output details Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding CHANGELOG.md Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Adding dataset schema Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update README with output URL details Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fix the Apify call syntax and final result URL message Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Add section on Actors to README Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Replace Docling CLI with docling-serve API This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include: - Redesign Dockerfile to use docling-serve as base image - Update actor.sh to communicate with API instead of running CLI commands - Improve content type handling for various output formats - Update input schema to align with API parameters - Reduce Docker image size from ~6GB to ~600MB - Update documentation and changelog to reflect architectural changes The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities. Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Overhaul the implementation using official docling-serve image This commit completely revamps the Actor implementation with two major improvements: 1) CRITICAL CHANGE: Switch to official docling-serve image * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image * Eliminates need for custom docling installation * Ensures compatibility with latest docling-serve features * Provides more reliable and consistent document processing 2) Fix Apify Actor KVS storage issues: * Standardize key names to follow Apify conventions: - Change "OUTPUT_RESULT" to "OUTPUT" - Change "DOCLING_LOG" to "LOG" * Add proper multi-stage Docker build: - First stage builds dependencies including apify-cli - Second stage uses official image and adds only necessary tools * Fix permission issues in Docker container: - Set up proper user and directory permissions - Create writable directories for temporary files and models - Configure environment variables for proper execution 3) Solve EACCES permission errors during CLI version checks: * Create temporary HOME directory with proper write permissions * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable * Add NODE_OPTIONS="--no-warnings" to suppress update checks * Support --no-update-notifier CLI flag when available 4) Improve code organization and reliability: * Create reusable upload_to_kvs() function for all KVS operations * Ensure log files are uploaded before tools directory is removed * Set proper MIME types based on output format * Add detailed error reporting and proper cleanup * Display final output URLs for easy verification This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Refactor `actor.sh` and add `docling_processor.py` Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API. Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Update CHANGELOG and README for Docker and API changes Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Removing obsolete actor.json keys Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Fixed input getter Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Always output a zip Signed-off-by: Adam Kliment <adam@netmilk.net> * Actor: Resolving conflicts with main Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Resolving conflicts with main (pass 2) Signed-off-by: Václav Vančura <commit@vancura.dev> * Actor: Updated main Readme and Actor Readme Signed-off-by: Adam Kliment <adam@netmilk.net> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Adam Kliment <adam@netmilk.net> Signed-off-by: Václav Vančura <commit@vancura.dev> Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Adam Kliment <adam@netmilk.net>
This commit is contained in:
parent
7e01798417
commit
772487f9c9
11
.actor/.dockerignore
Normal file
11
.actor/.dockerignore
Normal file
@ -0,0 +1,11 @@
|
||||
**/__pycache__
|
||||
**/*.pyc
|
||||
**/*.pyo
|
||||
**/*.pyd
|
||||
.git
|
||||
.gitignore
|
||||
.env
|
||||
.venv
|
||||
*.log
|
||||
.pytest_cache
|
||||
.coverage
|
69
.actor/CHANGELOG.md
Normal file
69
.actor/CHANGELOG.md
Normal file
@ -0,0 +1,69 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to the Docling Actor will be documented in this file.
|
||||
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [1.1.0] - 2025-03-09
|
||||
|
||||
### Changed
|
||||
|
||||
- Switched from full Docling CLI to docling-serve API
|
||||
- Using the official quay.io/ds4sd/docling-serve-cpu Docker image
|
||||
- Reduced Docker image size (from ~6GB to ~4GB)
|
||||
- Implemented multi-stage Docker build to handle dependencies
|
||||
- Improved Docker build process to ensure compatibility with docling-serve-cpu image
|
||||
- Added new Python processor script for reliable API communication and content extraction
|
||||
- Enhanced response handling with better content extraction logic
|
||||
- Fixed ES modules compatibility issue with Apify CLI
|
||||
- Added explicit tmpfs volume for temporary files
|
||||
- Fixed environment variables format in actor.json
|
||||
- Created optimized dependency installation approach
|
||||
- Improved API compatibility with docling-serve
|
||||
- Updated endpoint from custom `/convert` to standard `/v1alpha/convert/source`
|
||||
- Revised JSON payload structure to match docling-serve API format
|
||||
- Added proper output field parsing based on format
|
||||
- Enhanced startup process with health checks
|
||||
- Added configurable API host and port through environment variables
|
||||
- Better content type handling for different output formats
|
||||
- Updated error handling to align with API responses
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed actor input file conflict in get_actor_input(): now checks for and removes an existing /tmp/actor-input/INPUT directory if found, ensuring valid JSON input parsing.
|
||||
|
||||
### Technical Details
|
||||
|
||||
- Actor Specification v1
|
||||
- Using quay.io/ds4sd/docling-serve-cpu:latest base image
|
||||
- Node.js 20.x for Apify CLI
|
||||
- Eliminated Python dependencies
|
||||
- Simplified Docker build process
|
||||
|
||||
## [1.0.0] - 2025-02-07
|
||||
|
||||
### Added
|
||||
|
||||
- Initial release of Docling Actor
|
||||
- Support for multiple document formats (PDF, DOCX, images)
|
||||
- OCR capabilities for scanned documents
|
||||
- Multiple output formats (md, json, html, text, doctags)
|
||||
- Comprehensive error handling and logging
|
||||
- Dataset records with processing status
|
||||
- Memory monitoring and resource optimization
|
||||
- Security features including non-root user execution
|
||||
|
||||
### Technical Details
|
||||
|
||||
- Actor Specification v1
|
||||
- Docling v2.17.0
|
||||
- Python 3.11
|
||||
- Node.js 20.x
|
||||
- Comprehensive error codes:
|
||||
- 10: Invalid input
|
||||
- 11: URL inaccessible
|
||||
- 12: Docling processing failed
|
||||
- 13: Output file missing
|
||||
- 14: Storage operation failed
|
||||
- 15: OCR processing failed
|
87
.actor/Dockerfile
Normal file
87
.actor/Dockerfile
Normal file
@ -0,0 +1,87 @@
|
||||
# Build stage for installing dependencies
|
||||
FROM node:20-slim AS builder
|
||||
|
||||
# Install necessary tools and prepare dependencies environment in one layer
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& mkdir -p /build/bin /build/lib/node_modules \
|
||||
&& cp /usr/local/bin/node /build/bin/
|
||||
# Set working directory
|
||||
WORKDIR /build
|
||||
|
||||
# Create package.json and install Apify CLI in one layer
|
||||
RUN echo '{"name":"docling-actor-dependencies","version":"1.0.0","description":"Dependencies for Docling Actor","private":true,"type":"module","engines":{"node":">=18"}}' > package.json \
|
||||
&& npm install apify-cli@latest \
|
||||
&& cp -r node_modules/* lib/node_modules/ \
|
||||
&& echo '#!/bin/sh\n/tmp/docling-tools/bin/node /tmp/docling-tools/lib/node_modules/apify-cli/bin/run "$@"' > bin/actor \
|
||||
&& chmod +x bin/actor \
|
||||
# Clean up npm cache to reduce image size
|
||||
&& npm cache clean --force
|
||||
|
||||
# Final stage with docling-serve-cpu
|
||||
FROM quay.io/ds4sd/docling-serve-cpu:latest
|
||||
|
||||
LABEL maintainer="Vaclav Vancura <@vancura>" \
|
||||
description="Apify Actor for document processing using Docling" \
|
||||
version="1.1.0"
|
||||
|
||||
# Set only essential environment variables
|
||||
ENV PYTHONUNBUFFERED=1 \
|
||||
PYTHONDONTWRITEBYTECODE=1 \
|
||||
DOCLING_SERVE_HOST=0.0.0.0 \
|
||||
DOCLING_SERVE_PORT=5001
|
||||
|
||||
# Switch to root temporarily to set up directories and permissions
|
||||
USER root
|
||||
WORKDIR /app
|
||||
|
||||
# Install required tools and create directories in a single layer
|
||||
RUN dnf install -y \
|
||||
jq \
|
||||
&& dnf clean all \
|
||||
&& mkdir -p /build-files \
|
||||
/tmp \
|
||||
/tmp/actor-input \
|
||||
/tmp/actor-output \
|
||||
/tmp/actor-storage \
|
||||
/tmp/apify_input \
|
||||
/apify_input \
|
||||
/opt/app-root/src/.EasyOCR/user_network \
|
||||
/tmp/easyocr-models \
|
||||
&& chown 1000:1000 /build-files \
|
||||
&& chown -R 1000:1000 /opt/app-root/src/.EasyOCR \
|
||||
&& chmod 1777 /tmp \
|
||||
&& chmod 1777 /tmp/easyocr-models \
|
||||
&& chmod 777 /tmp/actor-input /tmp/actor-output /tmp/actor-storage /tmp/apify_input /apify_input \
|
||||
# Fix for uv_os_get_passwd error in Node.js
|
||||
&& echo "docling:x:1000:1000:Docling User:/app:/bin/sh" >> /etc/passwd
|
||||
|
||||
# Set environment variable to tell EasyOCR to use a writable location for models
|
||||
ENV EASYOCR_MODULE_PATH=/tmp/easyocr-models
|
||||
|
||||
# Copy only required files
|
||||
COPY --chown=1000:1000 .actor/actor.sh .actor/actor.sh
|
||||
COPY --chown=1000:1000 .actor/actor.json .actor/actor.json
|
||||
COPY --chown=1000:1000 .actor/input_schema.json .actor/input_schema.json
|
||||
COPY --chown=1000:1000 .actor/docling_processor.py .actor/docling_processor.py
|
||||
RUN chmod +x .actor/actor.sh
|
||||
|
||||
# Copy the build files from builder
|
||||
COPY --from=builder --chown=1000:1000 /build /build-files
|
||||
|
||||
|
||||
# Switch to non-root user
|
||||
USER 1000
|
||||
|
||||
# Set up TMPFS for temporary files
|
||||
VOLUME ["/tmp"]
|
||||
|
||||
# Create additional volumes for OCR models persistence
|
||||
VOLUME ["/tmp/easyocr-models"]
|
||||
|
||||
# Expose the docling-serve API port
|
||||
EXPOSE 5001
|
||||
|
||||
# Run the actor script
|
||||
ENTRYPOINT [".actor/actor.sh"]
|
314
.actor/README.md
Normal file
314
.actor/README.md
Normal file
@ -0,0 +1,314 @@
|
||||
# Docling Actor on Apify
|
||||
|
||||
[](https://apify.com/vancura/docling)
|
||||
|
||||
This Actor (specification v1) wraps the [Docling project](https://ds4sd.github.io/docling/) to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
|
||||
|
||||
## What are Actors?
|
||||
|
||||
[Actors](https://docs.apify.com/platform/actors?fpr=docling) are serverless microservices running on the [Apify Platform](https://apify.com/?fpr=docling). They are based on the [Actor SDK](https://docs.apify.com/sdk/js?fpr=docling) and can be found in the [Apify Store](https://apify.com/store?fpr=docling). Learn more about Actors in the [Apify Whitepaper](https://whitepaper.actor?fpr=docling).
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Features](#features)
|
||||
2. [Usage](#usage)
|
||||
3. [Input Parameters](#input-parameters)
|
||||
4. [Output](#output)
|
||||
5. [Performance & Resources](#performance--resources)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
7. [Local Development](#local-development)
|
||||
8. [Architecture](#architecture)
|
||||
9. [License](#license)
|
||||
10. [Acknowledgments](#acknowledgments)
|
||||
11. [Security Considerations](#security-considerations)
|
||||
|
||||
## Features
|
||||
|
||||
- Leverages the official docling-serve-cpu Docker image for efficient document processing
|
||||
- Processes multiple document formats:
|
||||
- PDF documents (scanned or digital)
|
||||
- Microsoft Office files (DOCX, XLSX, PPTX)
|
||||
- Images (PNG, JPG, TIFF)
|
||||
- Other text-based formats
|
||||
- Provides OCR capabilities for scanned documents
|
||||
- Exports to multiple formats:
|
||||
- Markdown
|
||||
- JSON
|
||||
- HTML
|
||||
- Plain Text
|
||||
- DocTags (structured format)
|
||||
- No local setup needed—just provide input via a simple JSON config
|
||||
|
||||
## Usage
|
||||
|
||||
### Using Apify Console
|
||||
|
||||
1. Go to the Apify Actor page.
|
||||
2. Click "Run".
|
||||
3. In the input form, fill in:
|
||||
- The URL of the document.
|
||||
- Output format (`md`, `json`, `html`, `text`, or `doctags`).
|
||||
- OCR boolean toggle.
|
||||
4. The Actor will run and produce its outputs in the default key-value store under the key `OUTPUT`.
|
||||
|
||||
### Using Apify API
|
||||
|
||||
```bash
|
||||
curl --request POST \
|
||||
--url "https://api.apify.com/v2/acts/vancura~docling/run" \
|
||||
--header 'Content-Type: application/json' \
|
||||
--header 'Authorization: Bearer YOUR_API_TOKEN' \
|
||||
--data '{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Using Apify CLI
|
||||
|
||||
```bash
|
||||
apify call vancura/docling --input='{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Input Parameters
|
||||
|
||||
The Actor accepts a JSON schema matching the file `.actor/input_schema.json`. Below is a summary of the fields:
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|----------------|---------|----------|----------|-------------------------------------------------------------------------------|
|
||||
| `http_sources` | object | Yes | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint |
|
||||
| `options` | object | No | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters |
|
||||
|
||||
### Example Input
|
||||
|
||||
```json
|
||||
{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The Actor provides three types of outputs:
|
||||
|
||||
1. **Processed Documents in a ZIP** - The Actor will provide the direct URL to your result in the run log, looking like:
|
||||
|
||||
```text
|
||||
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
|
||||
```
|
||||
|
||||
2. **Processing Log** - Available in the key-value store as `DOCLING_LOG`
|
||||
|
||||
3. **Dataset Record** - Contains processing metadata with:
|
||||
- Direct link to the processed output zip file
|
||||
- Processing status
|
||||
|
||||
You can access the results in several ways:
|
||||
|
||||
1. **Direct URL** (shown in Actor run logs):
|
||||
|
||||
```text
|
||||
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
|
||||
```
|
||||
|
||||
2. **Programmatically** via Apify CLI:
|
||||
|
||||
```bash
|
||||
apify key-value-stores get-value OUTPUT
|
||||
```
|
||||
|
||||
3. **Dataset** - Check the "Dataset" tab in the Actor run details to see processing metadata
|
||||
|
||||
### Example Outputs
|
||||
|
||||
#### Markdown (md)
|
||||
|
||||
```markdown
|
||||
# Document Title
|
||||
|
||||
## Section 1
|
||||
Content of section 1...
|
||||
|
||||
## Section 2
|
||||
Content of section 2...
|
||||
```
|
||||
|
||||
#### JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Document Title",
|
||||
"sections": [
|
||||
{
|
||||
"level": 1,
|
||||
"title": "Section 1",
|
||||
"content": "Content of section 1..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### HTML
|
||||
|
||||
```html
|
||||
<h1>Document Title</h1>
|
||||
<h2>Section 1</h2>
|
||||
<p>Content of section 1...</p>
|
||||
```
|
||||
|
||||
### Processing Logs (`DOCLING_LOG`)
|
||||
|
||||
The Actor maintains detailed processing logs including:
|
||||
|
||||
- API request and response details
|
||||
- Processing steps and timing
|
||||
- Error messages and stack traces
|
||||
- Input validation results
|
||||
|
||||
Access logs via:
|
||||
|
||||
```bash
|
||||
apify key-value-stores get-record DOCLING_LOG
|
||||
```
|
||||
|
||||
## Performance & Resources
|
||||
|
||||
- **Docker Image Size**: ~4GB
|
||||
- **Memory Requirements**:
|
||||
- Minimum: 2 GB RAM
|
||||
- Recommended: 4 GB RAM for large or complex documents
|
||||
- **Processing Time**:
|
||||
- Simple documents: 15-30 seconds
|
||||
- Complex PDFs with OCR: 1-3 minutes
|
||||
- Large documents (100+ pages): 3-10 minutes
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
1. **Document URL Not Accessible**
|
||||
- Ensure the URL is publicly accessible
|
||||
- Check if the document requires authentication
|
||||
- Verify the URL leads directly to the document
|
||||
|
||||
2. **OCR Processing Fails**
|
||||
- Verify the document is not password-protected
|
||||
- Check if the image quality is sufficient
|
||||
- Try processing with OCR disabled
|
||||
|
||||
3. **API Response Issues**
|
||||
- Check the logs for detailed error messages
|
||||
- Ensure the document format is supported
|
||||
- Verify the URL is correctly formatted
|
||||
|
||||
4. **Output Format Issues**
|
||||
- Verify the output format is supported
|
||||
- Check if the document structure is compatible
|
||||
- Review the `DOCLING_LOG` for specific errors
|
||||
|
||||
### Error Handling
|
||||
|
||||
The Actor implements comprehensive error handling:
|
||||
|
||||
- Detailed error messages in `DOCLING_LOG`
|
||||
- Proper exit codes for different failure scenarios
|
||||
- Automatic cleanup on failure
|
||||
- Dataset records with processing status
|
||||
|
||||
## Local Development
|
||||
|
||||
If you wish to develop or modify this Actor locally:
|
||||
|
||||
1. Clone the repository.
|
||||
2. Ensure Docker is installed.
|
||||
3. The Actor files are located in the `.actor` directory:
|
||||
- `Dockerfile` - Defines the container environment
|
||||
- `actor.json` - Actor configuration and metadata
|
||||
- `actor.sh` - Main execution script that starts the docling-serve API and orchestrates document processing
|
||||
- `input_schema.json` - Input parameter definitions
|
||||
- `dataset_schema.json` - Dataset output format definition
|
||||
- `CHANGELOG.md` - Change log documenting all notable changes
|
||||
- `README.md` - This documentation
|
||||
4. Run the Actor locally using:
|
||||
|
||||
```bash
|
||||
apify run
|
||||
```
|
||||
|
||||
### Actor Structure
|
||||
|
||||
```text
|
||||
.actor/
|
||||
├── Dockerfile # Container definition
|
||||
├── actor.json # Actor metadata
|
||||
├── actor.sh # Execution script (also starts docling-serve API)
|
||||
├── input_schema.json # Input parameters
|
||||
├── dataset_schema.json # Dataset output format definition
|
||||
├── docling_processor.py # Python script for API communication
|
||||
├── CHANGELOG.md # Version history and changes
|
||||
└── README.md # This documentation
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
This Actor uses a lightweight architecture based on the official `quay.io/ds4sd/docling-serve-cpu` Docker image:
|
||||
|
||||
- **Base Image**: `quay.io/ds4sd/docling-serve-cpu:latest` (~4GB)
|
||||
- **Multi-Stage Build**: Uses a multi-stage Docker build to include only necessary tools
|
||||
- **API Communication**: Uses the RESTful API provided by docling-serve
|
||||
- **Request Flow**:
|
||||
1. The actor script starts the docling-serve API on port 5001
|
||||
2. Performs health checks to ensure the API is running
|
||||
3. Processes the input parameters
|
||||
4. Creates a JSON payload for the docling-serve API with proper format:
|
||||
```json
|
||||
{
|
||||
"options": {
|
||||
"to_formats": ["md"],
|
||||
"do_ocr": true
|
||||
},
|
||||
"http_sources": [{"url": "https://example.com/document.pdf"}]
|
||||
}
|
||||
```
|
||||
5. Makes a POST request to the `/v1alpha/convert/source` endpoint
|
||||
6. Processes the response and stores it in the key-value store
|
||||
- **Dependencies**:
|
||||
- Node.js for Apify CLI
|
||||
- Essential tools (curl, jq, etc.) copied from build stage
|
||||
- **Security**: Runs as a non-root user for enhanced security
|
||||
|
||||
## License
|
||||
|
||||
This wrapper project is under the MIT License, matching the original Docling license. See [LICENSE](../LICENSE) for details.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- [Docling](https://ds4sd.github.io/docling/) and [docling-serve-cpu](https://quay.io/repository/ds4sd/docling-serve-cpu) by IBM
|
||||
- [Apify](https://apify.com/?fpr=docling) for the serverless actor environment
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Actor runs under a non-root user for enhanced security
|
||||
- Input URLs are validated before processing
|
||||
- Temporary files are securely managed and cleaned up
|
||||
- Process isolation through Docker containerization
|
||||
- Secure handling of processing artifacts
|
11
.actor/actor.json
Normal file
11
.actor/actor.json
Normal file
@ -0,0 +1,11 @@
|
||||
{
|
||||
"actorSpecification": 1,
|
||||
"name": "docling",
|
||||
"version": "0.0",
|
||||
"environmentVariables": {},
|
||||
"dockerFile": "./Dockerfile",
|
||||
"input": "./input_schema.json",
|
||||
"scripts": {
|
||||
"run": "./actor.sh"
|
||||
}
|
||||
}
|
419
.actor/actor.sh
Executable file
419
.actor/actor.sh
Executable file
@ -0,0 +1,419 @@
|
||||
#!/bin/bash
|
||||
|
||||
export PATH=$PATH:/build-files/node_modules/.bin
|
||||
|
||||
# Function to upload content to the key-value store
|
||||
upload_to_kvs() {
|
||||
local content_file="$1"
|
||||
local key_name="$2"
|
||||
local content_type="$3"
|
||||
local description="$4"
|
||||
|
||||
# Find the Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Uploading $description to key-value store (key: $key_name)..."
|
||||
|
||||
# Create a temporary home directory with write permissions
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command if flag isn't available
|
||||
if $apify_cmd actor:set-value "$key_name" --contentType "$content_type" < "$content_file"; then
|
||||
echo "Successfully uploaded $description to key-value store"
|
||||
local url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/$key_name"
|
||||
echo "$description available at: $url"
|
||||
cleanup_temp_environment
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "ERROR: Failed to upload $description to key-value store"
|
||||
cleanup_temp_environment
|
||||
return 1
|
||||
else
|
||||
echo "ERROR: Apify CLI not found for $description upload"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Function to find Apify CLI command
|
||||
find_apify_cmd() {
|
||||
FOUND_APIFY_CMD=""
|
||||
for cmd in "apify" "actor" "/usr/local/bin/apify" "/usr/bin/apify" "/opt/apify/cli/bin/apify"; do
|
||||
if command -v "$cmd" &> /dev/null; then
|
||||
FOUND_APIFY_CMD="$cmd"
|
||||
break
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Function to set up temporary environment for Apify CLI
|
||||
setup_temp_environment() {
|
||||
export TMPDIR="/tmp/apify-home-${RANDOM}"
|
||||
mkdir -p "$TMPDIR"
|
||||
export APIFY_DISABLE_VERSION_CHECK=1
|
||||
export NODE_OPTIONS="--no-warnings"
|
||||
export HOME="$TMPDIR" # Override home directory to writable location
|
||||
}
|
||||
|
||||
# Function to clean up temporary environment
|
||||
cleanup_temp_environment() {
|
||||
rm -rf "$TMPDIR" 2>/dev/null || true
|
||||
}
|
||||
|
||||
# Function to push data to Apify dataset
|
||||
push_to_dataset() {
|
||||
# Example usage: push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
|
||||
local result_url="$1"
|
||||
local size="$2"
|
||||
local format="$3"
|
||||
|
||||
# Find Apify CLI command
|
||||
find_apify_cmd
|
||||
local apify_cmd="$FOUND_APIFY_CMD"
|
||||
|
||||
if [ -n "$apify_cmd" ]; then
|
||||
echo "Adding record to dataset..."
|
||||
setup_temp_environment
|
||||
|
||||
# Use the --no-update-notifier flag if available
|
||||
if $apify_cmd --help | grep -q "\--no-update-notifier"; then
|
||||
if $apify_cmd --no-update-notifier actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
else
|
||||
# Fall back to regular command
|
||||
if $apify_cmd actor:push-data "{\"output_file\": \"${result_url}\", \"format\": \"${format}\", \"size\": \"${size}\", \"status\": \"success\"}"; then
|
||||
echo "Successfully added record to dataset"
|
||||
else
|
||||
echo "Warning: Failed to add record to dataset"
|
||||
fi
|
||||
fi
|
||||
|
||||
cleanup_temp_environment
|
||||
fi
|
||||
}
|
||||
|
||||
|
||||
# --- Setup logging and error handling ---
|
||||
|
||||
LOG_FILE="/tmp/docling.log"
|
||||
touch "$LOG_FILE" || {
|
||||
echo "Fatal: Cannot create log file at $LOG_FILE"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Log to both console and file
|
||||
exec 1> >(tee -a "$LOG_FILE")
|
||||
exec 2> >(tee -a "$LOG_FILE" >&2)
|
||||
|
||||
# Exit codes
|
||||
readonly ERR_API_UNAVAILABLE=15
|
||||
readonly ERR_INVALID_INPUT=16
|
||||
|
||||
|
||||
# --- Debug environment ---
|
||||
|
||||
echo "Date: $(date)"
|
||||
echo "Python version: $(python --version 2>&1)"
|
||||
echo "Docling-serve path: $(which docling-serve 2>/dev/null || echo 'Not found')"
|
||||
echo "Working directory: $(pwd)"
|
||||
|
||||
# --- Get input ---
|
||||
|
||||
echo "Getting Apify Actor Input"
|
||||
INPUT=$(apify actor get-input 2>/dev/null)
|
||||
|
||||
# --- Setup tools ---
|
||||
|
||||
echo "Setting up tools..."
|
||||
TOOLS_DIR="/tmp/docling-tools"
|
||||
mkdir -p "$TOOLS_DIR"
|
||||
|
||||
# Copy tools if available
|
||||
if [ -d "/build-files" ]; then
|
||||
echo "Copying tools from /build-files..."
|
||||
cp -r /build-files/* "$TOOLS_DIR/"
|
||||
export PATH="$TOOLS_DIR/bin:$PATH"
|
||||
else
|
||||
echo "Warning: No build files directory found. Some tools may be unavailable."
|
||||
fi
|
||||
|
||||
# Copy Python processor script to tools directory
|
||||
PYTHON_SCRIPT_PATH="$(dirname "$0")/docling_processor.py"
|
||||
if [ -f "$PYTHON_SCRIPT_PATH" ]; then
|
||||
echo "Copying Python processor script to tools directory..."
|
||||
cp "$PYTHON_SCRIPT_PATH" "$TOOLS_DIR/"
|
||||
chmod +x "$TOOLS_DIR/docling_processor.py"
|
||||
else
|
||||
echo "ERROR: Python processor script not found at $PYTHON_SCRIPT_PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check OCR directories and ensure they're writable
|
||||
echo "Checking OCR directory permissions..."
|
||||
OCR_DIR="/opt/app-root/src/.EasyOCR"
|
||||
if [ -d "$OCR_DIR" ]; then
|
||||
# Test if we can write to the directory
|
||||
if touch "$OCR_DIR/test_write" 2>/dev/null; then
|
||||
echo "[✓] OCR directory is writable"
|
||||
rm "$OCR_DIR/test_write"
|
||||
else
|
||||
echo "[✗] OCR directory is not writable, setting up alternative in /tmp"
|
||||
|
||||
# Create alternative in /tmp (which is writable)
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
else
|
||||
echo "OCR directory not found, creating in /tmp"
|
||||
mkdir -p "/tmp/.EasyOCR/user_network"
|
||||
export EASYOCR_MODULE_PATH="/tmp/.EasyOCR"
|
||||
fi
|
||||
|
||||
|
||||
# --- Starting the API ---
|
||||
|
||||
echo "Starting docling-serve API..."
|
||||
|
||||
# Create a dedicated working directory in /tmp (writable)
|
||||
API_DIR="/tmp/docling-api"
|
||||
mkdir -p "$API_DIR"
|
||||
cd "$API_DIR"
|
||||
echo "API working directory: $(pwd)"
|
||||
|
||||
# Find docling-serve executable
|
||||
DOCLING_SERVE_PATH=$(which docling-serve)
|
||||
echo "Docling-serve executable: $DOCLING_SERVE_PATH"
|
||||
|
||||
# Start the API with minimal parameters to avoid any issues
|
||||
echo "Starting docling-serve API..."
|
||||
"$DOCLING_SERVE_PATH" run --host 0.0.0.0 --port 5001 > "$API_DIR/docling-serve.log" 2>&1 &
|
||||
API_PID=$!
|
||||
echo "Started docling-serve API with PID: $API_PID"
|
||||
|
||||
# A more reliable wait for API startup
|
||||
echo "Waiting for API to initialize..."
|
||||
MAX_TRIES=30
|
||||
tries=0
|
||||
started=false
|
||||
|
||||
while [ $tries -lt $MAX_TRIES ]; do
|
||||
tries=$((tries + 1))
|
||||
|
||||
# Check if process is still running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API process terminated unexpectedly after $tries seconds"
|
||||
break
|
||||
fi
|
||||
|
||||
# Check log for startup completion or errors
|
||||
if grep -q "Application startup complete" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "[✓] API startup completed successfully after $tries seconds"
|
||||
started=true
|
||||
break
|
||||
fi
|
||||
|
||||
if grep -q "Permission denied\|PermissionError" "$API_DIR/docling-serve.log" 2>/dev/null; then
|
||||
echo "ERROR: Permission errors detected in API startup"
|
||||
break
|
||||
fi
|
||||
|
||||
# Sleep and check again
|
||||
sleep 1
|
||||
|
||||
# Output a progress indicator every 5 seconds
|
||||
if [ $((tries % 5)) -eq 0 ]; then
|
||||
echo "Still waiting for API startup... ($tries/$MAX_TRIES seconds)"
|
||||
fi
|
||||
done
|
||||
|
||||
# Show log content regardless of outcome
|
||||
echo "docling-serve log output so far:"
|
||||
tail -n 20 "$API_DIR/docling-serve.log"
|
||||
|
||||
# Verify the API is running
|
||||
if ! ps -p $API_PID > /dev/null; then
|
||||
echo "ERROR: docling-serve API failed to start"
|
||||
if [ -f "$API_DIR/docling-serve.log" ]; then
|
||||
echo "Full log output:"
|
||||
cat "$API_DIR/docling-serve.log"
|
||||
fi
|
||||
exit $ERR_API_UNAVAILABLE
|
||||
fi
|
||||
|
||||
if [ "$started" != "true" ]; then
|
||||
echo "WARNING: API process is running but startup completion was not detected"
|
||||
echo "Will attempt to continue anyway..."
|
||||
fi
|
||||
|
||||
# Try to verify API is responding at this point
|
||||
echo "Verifying API responsiveness..."
|
||||
(python -c "
|
||||
import sys, time, socket
|
||||
for i in range(5):
|
||||
try:
|
||||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
s.settimeout(1)
|
||||
result = s.connect_ex(('localhost', 5001))
|
||||
if result == 0:
|
||||
s.close()
|
||||
print('Port 5001 is open and accepting connections')
|
||||
sys.exit(0)
|
||||
s.close()
|
||||
except Exception as e:
|
||||
pass
|
||||
time.sleep(1)
|
||||
print('Could not connect to API port after 5 attempts')
|
||||
sys.exit(1)
|
||||
" && echo "API verification succeeded") || echo "API verification failed, but continuing anyway"
|
||||
|
||||
# Define API endpoint
|
||||
DOCLING_API_ENDPOINT="http://localhost:5001/v1alpha/convert/source"
|
||||
|
||||
|
||||
# --- Processing document ---
|
||||
|
||||
echo "Starting document processing..."
|
||||
echo "Reading input from Apify..."
|
||||
|
||||
echo "Input content:" >&2
|
||||
echo "$INPUT" >&2 # Send the raw input to stderr for debugging
|
||||
echo "$INPUT" # Send the clean JSON to stdout for processing
|
||||
|
||||
# Create the request JSON
|
||||
|
||||
REQUEST_JSON=$(echo $INPUT | jq '.options += {"return_as_file": true}')
|
||||
|
||||
echo "Creating request JSON:" >&2
|
||||
echo "$REQUEST_JSON" >&2
|
||||
echo "$REQUEST_JSON" > "$API_DIR/request.json"
|
||||
|
||||
|
||||
# Send the conversion request using our Python script
|
||||
#echo "Sending conversion request to docling-serve API..."
|
||||
#python "$TOOLS_DIR/docling_processor.py" \
|
||||
# --api-endpoint "$DOCLING_API_ENDPOINT" \
|
||||
# --request-json "$API_DIR/request.json" \
|
||||
# --output-dir "$API_DIR" \
|
||||
# --output-format "$OUTPUT_FORMAT"
|
||||
|
||||
echo "Curl the Docling API"
|
||||
curl -s -H "content-type: application/json" -X POST --data-binary @$API_DIR/request.json -o $API_DIR/output.zip $DOCLING_API_ENDPOINT
|
||||
|
||||
CURL_EXIT_CODE=$?
|
||||
|
||||
# --- Check for various potential output files ---
|
||||
|
||||
echo "Checking for output files..."
|
||||
if [ -f "$API_DIR/output.zip" ]; then
|
||||
echo "Conversion completed successfully! Output file found."
|
||||
|
||||
# Get content from the converted file
|
||||
OUTPUT_SIZE=$(wc -c < "$API_DIR/output.zip")
|
||||
echo "Output file found with size: $OUTPUT_SIZE bytes"
|
||||
|
||||
# Calculate the access URL for result display
|
||||
RESULT_URL="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/OUTPUT"
|
||||
|
||||
echo "=============================="
|
||||
echo "PROCESSING COMPLETE!"
|
||||
echo "Output size: ${OUTPUT_SIZE} bytes"
|
||||
echo "=============================="
|
||||
|
||||
# Set the output content type based on format
|
||||
CONTENT_TYPE="application/zip"
|
||||
|
||||
# Upload the document content using our function
|
||||
upload_to_kvs "$API_DIR/output.zip" "OUTPUT" "$CONTENT_TYPE" "Document content"
|
||||
|
||||
# Only proceed with dataset record if document upload succeeded
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Your document is available at: ${RESULT_URL}"
|
||||
echo "=============================="
|
||||
|
||||
# Push data to dataset
|
||||
push_to_dataset "$RESULT_URL" "$OUTPUT_SIZE" "zip"
|
||||
fi
|
||||
else
|
||||
echo "ERROR: No converted output file found at $API_DIR/output.zip"
|
||||
|
||||
# Create error metadata
|
||||
ERROR_METADATA="{\"status\":\"error\",\"error\":\"No converted output file found\",\"documentUrl\":\"$DOCUMENT_URL\"}"
|
||||
echo "$ERROR_METADATA" > "/tmp/actor-output/OUTPUT"
|
||||
chmod 644 "/tmp/actor-output/OUTPUT"
|
||||
|
||||
echo "Error information has been saved to /tmp/actor-output/OUTPUT"
|
||||
fi
|
||||
|
||||
|
||||
# --- Verify output files for debugging ---
|
||||
|
||||
echo "=== Final Output Verification ==="
|
||||
echo "Files in /tmp/actor-output:"
|
||||
ls -la /tmp/actor-output/ 2>/dev/null || echo "Cannot list /tmp/actor-output/"
|
||||
|
||||
echo "All operations completed. The output should be available in the default key-value store."
|
||||
echo "Content URL: ${RESULT_URL:-No URL available}"
|
||||
|
||||
|
||||
# --- Cleanup function ---
|
||||
|
||||
cleanup() {
|
||||
echo "Running cleanup..."
|
||||
|
||||
# Stop the API process
|
||||
if [ -n "$API_PID" ]; then
|
||||
echo "Stopping docling-serve API (PID: $API_PID)..."
|
||||
kill $API_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Export log file to KVS if it exists
|
||||
# DO THIS BEFORE REMOVING TOOLS DIRECTORY
|
||||
if [ -f "$LOG_FILE" ]; then
|
||||
if [ -s "$LOG_FILE" ]; then
|
||||
echo "Log file is not empty, pushing to key-value store (key: LOG)..."
|
||||
|
||||
# Upload log using our function
|
||||
upload_to_kvs "$LOG_FILE" "LOG" "text/plain" "Log file"
|
||||
else
|
||||
echo "Warning: log file exists but is empty"
|
||||
fi
|
||||
else
|
||||
echo "Warning: No log file found"
|
||||
fi
|
||||
|
||||
# Clean up temporary files AFTER log is uploaded
|
||||
echo "Cleaning up temporary files..."
|
||||
if [ -d "$API_DIR" ]; then
|
||||
echo "Removing API working directory: $API_DIR"
|
||||
rm -rf "$API_DIR" 2>/dev/null || echo "Warning: Failed to remove $API_DIR"
|
||||
fi
|
||||
|
||||
if [ -d "$TOOLS_DIR" ]; then
|
||||
echo "Removing tools directory: $TOOLS_DIR"
|
||||
rm -rf "$TOOLS_DIR" 2>/dev/null || echo "Warning: Failed to remove $TOOLS_DIR"
|
||||
fi
|
||||
|
||||
# Keep log file until the very end
|
||||
echo "Script execution completed at $(date)"
|
||||
echo "Actor execution completed"
|
||||
}
|
||||
|
||||
# Register cleanup
|
||||
trap cleanup EXIT
|
31
.actor/dataset_schema.json
Normal file
31
.actor/dataset_schema.json
Normal file
@ -0,0 +1,31 @@
|
||||
{
|
||||
"title": "Docling Actor Dataset",
|
||||
"description": "Records of document processing results from the Docling Actor",
|
||||
"type": "object",
|
||||
"schemaVersion": 1,
|
||||
"properties": {
|
||||
"url": {
|
||||
"title": "Document URL",
|
||||
"type": "string",
|
||||
"description": "URL of the processed document"
|
||||
},
|
||||
"output_file": {
|
||||
"title": "Result URL",
|
||||
"type": "string",
|
||||
"description": "Direct URL to the processed result in key-value store"
|
||||
},
|
||||
"status": {
|
||||
"title": "Processing Status",
|
||||
"type": "string",
|
||||
"description": "Status of the document processing",
|
||||
"enum": ["success", "error"]
|
||||
},
|
||||
"error": {
|
||||
"title": "Error Details",
|
||||
"type": "string",
|
||||
"description": "Error message if processing failed",
|
||||
"optional": true
|
||||
}
|
||||
},
|
||||
"required": ["url", "output_file", "status"]
|
||||
}
|
27
.actor/input_schema.json
Normal file
27
.actor/input_schema.json
Normal file
@ -0,0 +1,27 @@
|
||||
{
|
||||
"title": "Docling Actor Input",
|
||||
"description": "Options for processing documents with Docling via the docling-serve API.",
|
||||
"type": "object",
|
||||
"schemaVersion": 1,
|
||||
"properties": {
|
||||
"http_sources": {
|
||||
"title": "Document URLs",
|
||||
"type": "array",
|
||||
"description": "URLs of documents to process. Supported formats: PDF, DOCX, PPTX, XLSX, HTML, MD, XML, images, and more.",
|
||||
"editor": "json",
|
||||
"prefill": [
|
||||
{ "url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf" }
|
||||
]
|
||||
},
|
||||
"options": {
|
||||
"title": "Processing Options",
|
||||
"type": "object",
|
||||
"description": "Document processing configuration options",
|
||||
"editor": "json",
|
||||
"prefill": {
|
||||
"to_formats": ["md"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["options", "http_sources"]
|
||||
}
|
27
README.md
27
README.md
@ -21,6 +21,7 @@
|
||||
[](https://github.com/pre-commit/pre-commit)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://pepy.tech/projects/docling)
|
||||
[](https://apify.com/vancura/docling)
|
||||
|
||||
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||
|
||||
@ -85,6 +86,32 @@ To further accelerate your AI application development, check out Docling's nativ
|
||||
[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
|
||||
and tools.
|
||||
|
||||
## Apify Actor
|
||||
|
||||
<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
|
||||
|
||||
You can run Docling in the cloud without installation using the [Docling Actor](https://apify.com/vancura/docling?fpr=docling) on Apify platform. Simply provide a document URL and get the processed result:
|
||||
|
||||
```bash
|
||||
apify call vancura/docling -i '{
|
||||
"options": {
|
||||
"to_formats": ["md", "json", "html", "text", "doctags"]
|
||||
},
|
||||
"http_sources": [
|
||||
{"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
|
||||
{"url": "https://arxiv.org/pdf/2408.09869"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
The Actor stores results in:
|
||||
|
||||
* Processed document in key-value store (`OUTPUT_RESULT`)
|
||||
* Processing logs (`DOCLING_LOG`)
|
||||
* Dataset record with result URL and status
|
||||
|
||||
Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
|
||||
|
||||
## Get help and support
|
||||
|
||||
Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
|
||||
|
Loading…
Reference in New Issue
Block a user