Commit Graph

560 Commits

Author SHA1 Message Date
Cesar Berrospi Ramis
d5f7798763
test(html): fix regression test after docling-core update (#1197)
Update docling-core dependency to version 2.23.3.
Fix regression test of HTML backend after docling-core dependency update.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-19 11:03:46 +01:00
Rafael Teixeira de Lima
0b707d0882
fix(msword): Fixing function return in equations handling (#1194)
* Fixing function return

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Add message

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
2025-03-19 10:34:25 +01:00
Michele Dolfi
1d680b0a32
docs: Linux Foundation AI & Data (#1183)
* point the auxiliary files to the community repo and add lfai in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs index

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-19 09:05:57 +01:00
Michele Dolfi
54a78c307d
docs: move apify to docs (#1182)
move apify to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-18 16:43:55 +01:00
Maxim Lysak
2f72167ff6
feat: updated vlm pipeline (with latest changes from docling-core) (#1158)
* Draft implementation of Doctag backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated VLM pipeline doctags to docling conversion, now properly supports lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* preparing to migrate to new doctags deserializer

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* re-using DocTagsDocument.from_doctags_and_image_pairs

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* satisfying mypy and other checks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added support for force_backend_text parameter

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed unnecessary transformation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated readme

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-18 15:44:51 +01:00
github-actions[bot]
1a2a9e4eff chore: bump version to 2.27.0 [skip ci] 2025-03-18 13:37:45 +00:00
Michele Dolfi
6eaae3cba0
feat: add factory for ocr engines via plugins (#1010)
* add factory for ocr engines

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* apply pre-commit after rebase

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add picture description factory

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix enable option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* switch to create methods

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* make `options` an explicit kwarg

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* keep old lock of docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add allow_external_plugins option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add factory return and ignore options type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-18 13:58:05 +01:00
Christoph Auer
3960b199d6
feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905)
* Add DoclingParseV3 backend implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use docling-core with docling-parse types

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes and test updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix streams

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update test units

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back DoclingParse v1 backend, pipeline options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update locks

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Ground-truth files updated

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Text fixes, new test data

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename docling backend to v4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Test all backends, fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reset all tests to use docling-parse v1 for now

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for DPv4 backend init, better test coverage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* test_input_doc use default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-18 10:38:19 +01:00
Václav Vančura
772487f9c9
feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas  (#719)

fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* chore: bump version to 2.15.1 [skip ci]

* Actor: Initial implementation

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: .dockerignore update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Actor badge

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Moving the badge where it belongs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Switching Docker to python:3.11-slim-bookworm

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Docker security with proper user permissions

- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Optimize Dockerfile with security and size improvements

- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add Docker image metadata labels

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update dependencies with fixed versions

Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix apify-cli version problem

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Create Apify user home directory in Docker setup

Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update Docker configuration for improved security

- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve shell script robustness and error handling

The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve script logging and error handling

- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updating Docling to 2.17.0

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Dockerfile with additional utilities and env vars

- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Apify FirstPromoter integration

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the "Run on Apify" button

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixing example PDF document URLs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding input document URL validation

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix quoting in `DOC_CONVERT_CMD` variable

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add specific error codes for better error handling

- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance error handling and data logging

- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Normalize key-value store terminology

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance `README.md` with output details

Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding CHANGELOG.md

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding dataset schema

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update README with output URL details

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix the Apify call syntax and final result URL message

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add section on Actors to README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Replace Docling CLI with docling-serve API

This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Overhaul the implementation using official docling-serve image

This commit completely revamps the Actor implementation with two major improvements:

1) CRITICAL CHANGE: Switch to official docling-serve image
   * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
   * Eliminates need for custom docling installation
   * Ensures compatibility with latest docling-serve features
   * Provides more reliable and consistent document processing

2) Fix Apify Actor KVS storage issues:
   * Standardize key names to follow Apify conventions:
     - Change "OUTPUT_RESULT" to "OUTPUT"
     - Change "DOCLING_LOG" to "LOG"
   * Add proper multi-stage Docker build:
     - First stage builds dependencies including apify-cli
     - Second stage uses official image and adds only necessary tools
   * Fix permission issues in Docker container:
     - Set up proper user and directory permissions
     - Create writable directories for temporary files and models
     - Configure environment variables for proper execution

3) Solve EACCES permission errors during CLI version checks:
   * Create temporary HOME directory with proper write permissions
   * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
   * Add NODE_OPTIONS="--no-warnings" to suppress update checks
   * Support --no-update-notifier CLI flag when available

4) Improve code organization and reliability:
   * Create reusable upload_to_kvs() function for all KVS operations
   * Ensure log files are uploaded before tools directory is removed
   * Set proper MIME types based on output format
   * Add detailed error reporting and proper cleanup
   * Display final output URLs for easy verification

This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Refactor `actor.sh` and add `docling_processor.py`

Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update CHANGELOG and README for Docker and API changes

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Removing obsolete actor.json keys

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixed input getter

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Always output a zip

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Resolving conflicts with main

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Resolving conflicts with main (pass 2)

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updated main Readme and Actor Readme

Signed-off-by: Adam Kliment <adam@netmilk.net>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
Signed-off-by: Václav Vančura <commit@vancura.dev>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adam Kliment <adam@netmilk.net>
2025-03-18 10:17:44 +01:00
serced
7e01798417
docs: fix spelling of picture in usage (#1165)
Signed-off-by: serced <52759935+serced@users.noreply.github.com>
2025-03-17 09:33:51 +01:00
Michele Dolfi
fa16b12316
chore: move to docling-project org (#1160)
* chore: rename org

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update docs/faq/index.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* update github pages

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* revert test content

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-14 12:35:29 +01:00
Cesar Berrospi Ramis
f94da44ec5
fix(html): handle nested empty lists (#1154)
Address the case of nested lists in empty list items.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 16:56:58 +01:00
Panos Vagenas
0945973b79
fix: use first table row as col headers (#1156)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-03-13 15:34:18 +01:00
Rafael Teixeira de Lima
6eb718f849
feat: equations to latex in MSWord backend (with inline groups) (#1114)
* Equation groups

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Proper handling of orphan IDs in layout postprocessing (#1118)

* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.25.2 [skip ci]

* docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)

add env var in docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(CLI): fix help message for abort options (#1130)

fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* perf: New revision code formula model and document picture classifier (#1140)

* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: bump version to 2.26.0 [skip ci]

* fix: Pass tests, update docling-core to 2.22.0 (#1150)

fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Updating content hash

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Matteo <43417658+Matteo-Omenetti@users.noreply.github.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 15:12:22 +01:00
Cesar Berrospi Ramis
aa92a57fa9
fix: Pass tests, update docling-core to 2.22.0 (#1150)
fix: update docling-core to 2.22.0

Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-13 09:45:55 +01:00
github-actions[bot]
17c5bf1242 chore: bump version to 2.26.0 [skip ci] 2025-03-11 11:12:43 +00:00
Christoph Auer
eb97357b05
feat: Use new TableFormer model weights and default to accurate model version (#1100)
* feat: New tableformer model weights [WIP]

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Updated TF version

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests, after merging with Main, Switched to Accurate TF model by default

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-11 10:53:49 +01:00
Matteo
5e30381c0d
perf: New revision code formula model and document picture classifier (#1140)
* new version code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new version document picture classifier

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* new code formula model

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* restored original code formula test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
2025-03-11 10:15:28 +01:00
Michele Dolfi
4d64c4c0b6
fix(CLI): fix help message for abort options (#1130)
fix help message

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-07 14:47:49 +01:00
Michele Dolfi
e1c49ad727
docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124)
add env var in docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-06 07:30:07 +01:00
github-actions[bot]
a3c957ca6b chore: bump version to 2.25.2 [skip ci] 2025-03-05 14:51:57 +00:00
Christoph Auer
c56ab3a66b
fix: Proper handling of orphan IDs in layout postprocessing (#1118)
* Fix the handling of orphan IDs in layout postprocessing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-05 14:30:59 +01:00
Michele Dolfi
357d41cc47
docs: Enrichment models (#1097)
* warning for develop examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for enrichment models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* minor reorg of top-level docs (#1098)

* minor reorg of top-level docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix typo [no ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* trigger ci

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-04 14:24:38 +01:00
github-actions[bot]
b1e79cadc7 chore: bump version to 2.25.1 [skip ci] 2025-03-03 00:56:40 +00:00
Michele Dolfi
0c1e9391de
chore: use gh cache for huggingface models (#1096)
* use gh cache for huggingface models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* increase hf timeout

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more timeout

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use different cache key in each job

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-03 00:13:47 +01:00
Michele Dolfi
8dc0562542
fix: enable locks for threadsafe pdfium (#1052)
* enable locks for threadsafe pdfium

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix deadlock in pypdfium2 backend

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-02 20:06:44 +01:00
Peter W. J. Staar
e25d557c06
refactor: add the contentlayer to html-backend (#1040)
* added the contentlayer to html-backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the handle_image function

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code of html backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* test(html): add more info if a test case fails

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor(html): put parsed item in body if doc has no header

In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: set TextItem label to 'text' instead of 'paragraph'

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-02 10:37:53 -05:00
Panos Vagenas
db3ceefd4a
docs: improve docs on token limit warning triggered by HybridChunker (#1077)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-02-28 14:54:46 +01:00
Cesar Berrospi Ramis
de7b963b09
fix(html): use 'start' attribute when parsing ordered lists from HTML docs (#1062)
* fix(html): use 'start' attribute in ordered lists

When parsing ordered lists in HTML, take into account the 'start' attribute if it exists.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore(html): reduce verbosity in HTML backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-27 09:46:57 +01:00
github-actions[bot]
37dd8c1cc7 chore: bump version to 2.25.0 [skip ci] 2025-02-26 14:16:15 +00:00
Christoph Auer
3c9fe76b70
feat: [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054)
* Skeleton for SmolDocling model and VLM Pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* wip smolDocling inference and vlm pipeline

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* WIP, first working code for inference of SmolDocling, and vlm pipeline assembly code, example included.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes to preserve page image and demo export to html

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Enabled figure support in vlm_pipeline

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix for table span compute in vlm_pipeline

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Properly propagating image data per page, together with predicted tags in VLM pipeline. This enables correct figure extraction and page numbers in provenances

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up logs, added pages to vlm_pipeline, basic timing per page measurement in smol_docling models

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Replaced hardcoded otsl tokens with the ones from docling-core tokens.py enum

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added tokens/sec measurement, improved example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added capability for vlm_pipeline to grab text from preconfigured backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Exposed "force_backend_text" as pipeline parameter

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Flipped keep_backend to True for vlm_pipeline assembly to work

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated vlm pipeline assembly and smol docling model code to support updated doctags

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixing doctags starting tag, that broke elements on first line during assembly

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Introduced SmolDoclingOptions to configure model parameters (such as query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Moved artifacts_path for SmolDocling into vlm_options instead of global pipeline option

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* New assembly code for latest model revision, updated prompt and parsing of doctags, updated logging

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated example of Smol Docling usage

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added captions for the images for SmolDocling assembly code, improved provenance definition for all elements

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update minimal smoldocling example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix repo id

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleaned up unnecessary logging

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* More elegant solution in removing the input prompt

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed minimal_smol_docling example from CI checks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Removed special html code wrapping when exporting to docling document, cleaned up comments

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Addressing PR comments, added enabled property to SmolDocling, and related VLM pipeline option, few other minor things

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Moved keep_backend = True to vlm pipeline

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed pipeline_options.generate_table_images from vlm_pipeline (deprecated in the pipelines)

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added example on how to get original predicted doctags in minimal_smol_docling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removing changes from base_pipeline

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Replaced remaining strings to appropriate enums

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated poetry.lock

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* re-built poetry.lock

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Generalize and refactor VLM pipeline and models

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename example

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Expose control over using flash_attention_2

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix VLM example exclusion in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back device_map and accelerate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make drawing code resilient against bad bboxes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: clean up code and comments

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: more cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: fix leftover .to(device)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: add proper table provenance

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-02-26 14:43:26 +01:00
Panos Vagenas
ab683e4fb6
feat(cli): add option for downloading all models, refine help messages (#1061)
* chore(cli): update download help messages

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* add `--all` flag to model download CLI

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-02-26 13:27:29 +01:00
Michele Dolfi
e197225739
fix: vlm using artifacts path (#1057)
* fix usage of artifacts path

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add granite vision to the download utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-26 08:33:50 +01:00
Panos Vagenas
c84b973959
docs: extend chunking docs, add FAQ on token limit (#1053)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-02-25 13:07:38 +01:00
Cesar Berrospi Ramis
1b0ead6907
fix(html): Parse text in div elements as TextItem (#1041)
feat(html): Parse text in div elements as TextItem

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-24 12:38:29 +01:00
Suehtam
1d17e7397a
test: avoid testing exact JSON in CSV backend (#1038)
* feat: updated verify_export
Moved verify_export to verify_utils
Reuse verify_export in tests

Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>

* feat: replace verify_export with verify_document in CSV conversion tests

Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>

---------

Signed-off-by: Matheus Abdias <matheusfabdias@gmail.com>
2025-02-24 08:10:40 +01:00
github-actions[bot]
d8a81c3168 chore: bump version to 2.24.0 [skip ci] 2025-02-20 18:31:20 +00:00
Christoph Auer
c93e36988f
feat: Implement new reading-order model (#916)
* Implement new reading-order model, replacing DS GLM model (WIP)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update reading-order model branch

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile [skip ci]

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add captions, footnotes and merges [skip ci]

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updates for reading-order implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updates for reading-order implementation

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes, update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add normalization, update tests again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests with code

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Push final lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* sanitize text

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Inlcude furniture, Update tests with furniture

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix content_layer assignment

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Delete empty file docling/models/ds_glm_model.py

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-02-20 17:51:17 +01:00
github-actions[bot]
c031a7ae47 chore: bump version to 2.23.1 [skip ci] 2025-02-20 16:26:41 +00:00
Cesar Berrospi Ramis
1ac010354f
test: avoid testing exact JSON (#1027)
* test: avoid testing exact JSON

Avoid testing exact JSON output in html and xml backends.
Reuse the JSON verify helper function among backend test files.
Improve type annotations in html backend.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* Update tests/test_backend_patent_uspto.py

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-02-20 16:20:07 +01:00
fanszoro
6796f0a132
fix: Runtime error when Pandas Series is not always of string type (#1024)
Signed-off-by: fan <fansluck@qq.com>
2025-02-20 15:41:41 +01:00
Christoph Auer
dfcc30dddb
chore: Update tests and lockfile (#1021)
Update tests and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-19 16:51:53 +01:00
Panos Vagenas
27c04007bc
docs: revamp picture description example (#1015)
* docs: revamp picture description example

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* Improvements for visualization example (#1017)

* fix colab install, use granite and improve viz of description

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* switch docs to notbook

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* show results with all models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* show other vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-19 11:28:54 +01:00
Cesar Berrospi Ramis
7450050ace
refactor: upgrade BeautifulSoup4 with type hints (#999)
* refactor: upgrade BeautifulSoup4 with type hints

Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* build: allow beautifulsoup4 version 4.12.3

Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-18 11:30:47 +01:00
github-actions[bot]
75db61127c chore: bump version to 2.23.0 [skip ci] 2025-02-17 14:22:49 +00:00
Maxim Lysak
6e75f0b5d3
fix: Revise DocTags, fix iterate_items to output content_layer in items (#965)
* Testing fix for docling-core dt

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* fix: Fix code_formula test unit, update test-cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix code-formula model for new docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Update fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases for office formats

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update deps and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 14:11:55 +01:00
Ahmed Nassar
77eb77bdc2
feat: Support cuda:n GPU device allocation (#694)
* Adding multi-gpu support, and cuda device allocation

Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixes pydantic exception with cuda:n
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Pydantic field validator and comment restored.

Signed-off-by: ahn <ahn@zurich.ibm.com>

* chore: Accept AcceleratorDevice enum type

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Resetted some options to default, removed EasyOCR model wrap.
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Fixed rebased issues
Signed-off-by: ahn <ahn@zurich.ibm.com>

* Revert accelerator test options
Signed-off-by: ahn <ahn@zurich.ibm.com>

---------

Signed-off-by: ahn <ahn@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: ahn <ahn@sonny.zuvela.ibm.com>
Co-authored-by: ahn <ahn@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 11:31:13 +01:00
Cesar Berrospi Ramis
428b656793
feat(xml-jats): parse XML JATS documents (#967)
* chore(xml-jats): separate authors and affiliations

In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(xml-jats): replace new line character by a space

Instead of removing new line character from text, replace it by a space character.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xml-jats): improve existing parser and extend features

Partially support lists, respect reading order, parse more sections, support equations, better text formatting.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore(xml-jats): rename PubMed objects to JATS

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00
Michele Dolfi
e1436a8b05
test: validate actual docitems in tests (#966)
* validate actual docitems in tests

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove verbose print

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* disable test generation

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-02-14 17:47:53 +01:00
github-actions[bot]
ffbde1d1b0 chore: bump version to 2.22.0 [skip ci] 2025-02-14 08:53:20 +00:00