feat(xml-jats): parse XML JATS documents (#967)
* chore(xml-jats): separate authors and affiliations In XML PubMed (JATS) backend, convert authors and affiliations as they are typically rendered on PDFs. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * fix(xml-jats): replace new line character by a space Instead of removing new line character from text, replace it by a space character. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * feat(xml-jats): improve existing parser and extend features Partially support lists, respect reading order, parse more sections, support equations, better text formatting. Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> * chore(xml-jats): rename PubMed objects to JATS Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --------- Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e1436a8b05
commit
428b656793
@@ -333,11 +333,11 @@ class _DocumentConversionInput(BaseModel):
|
||||
):
|
||||
input_format = InputFormat.XML_USPTO
|
||||
|
||||
if (
|
||||
InputFormat.XML_PUBMED in formats
|
||||
and "/NLM//DTD JATS" in xml_doctype
|
||||
if InputFormat.XML_JATS in formats and (
|
||||
"JATS-journalpublishing" in xml_doctype
|
||||
or "JATS-archive" in xml_doctype
|
||||
):
|
||||
input_format = InputFormat.XML_PUBMED
|
||||
input_format = InputFormat.XML_JATS
|
||||
|
||||
elif mime == "text/plain":
|
||||
if InputFormat.XML_USPTO in formats and content_str.startswith("PATN\r\n"):
|
||||
|
||||
Reference in New Issue
Block a user