* Add DoclingParseV3 backend implementation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use docling-core with docling-parse types
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes and test updates
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix streams
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Reset tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* update test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* update test units
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add back DoclingParse v1 backend, pipeline options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update locks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Ground-truth files updated
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests, use TextCell.from_ocr property
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Text fixes, new test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Rename docling backend to v4
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Test all backends, fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Reset all tests to use docling-parse v1 for now
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for DPv4 backend init, better test coverage
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* test_input_doc use default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* added http header support for document converter and cli
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* fixed formatting and typing issues
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* use pydantic to parse dict
suggested by @dolfim-ibm
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
---------
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade to ds-glm 1.0 and docling-parse 3.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix DP2 backend code, change CLI default backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat: added excel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first msexcel backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tooling for the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working version for excel parsing of tables
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added proper typing for mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactor EXCEL to XLSX
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the unit tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* ran poetry lock
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding images to output [WIP]
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the msexcel (2)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added tests for merged cells in excel
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Add settings to turn visualization on or off
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add profiling code to all models
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Refactor and fix profiling codes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Visualization codes output PNG to debug dir
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for time logging
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Optimize imports
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add start_timestamps to ProfilingItem
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
- fields `output` & `assembled` need not be optional
- introduced "synonym" `ConversionResult` for `ConvertedDocument` & deprecated the latter
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* Put safety-checks for failed parse of pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bump to docling-parse 1.1.1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce page-level error checks
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use docling-parse page-by-page
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Propagate document_hash to PDF backends, use docling-parse 1.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Upgrade lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* repin after more packages on pypi
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* Add assemble options and example saving pages and figures
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add options for different page elements, improve example and flip name of assemble_options
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>