fix: update docling-core to 2.22.0
Update dependency library docling-core to latest release 2.22.0
Fix regression tests and ground truth files
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: upgrade BeautifulSoup4 with type hints
Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* build: allow beautifulsoup4 version 4.12.3
Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): separate authors and affiliations
In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix(xml-jats): replace new line character by a space
Instead of removing new line character from text, replace it by a space character.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat(xml-jats): improve existing parser and extend features
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore(xml-jats): rename PubMed objects to JATS
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>