Docling/docs/supported_formats.md
Cesar Berrospi Ramis 428b656793
feat(xml-jats): parse XML JATS documents (#967)
* chore(xml-jats): separate authors and affiliations

In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(xml-jats): replace new line character by a space

Instead of removing new line character from text, replace it by a space character.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* feat(xml-jats): improve existing parser and extend features

Partially support lists, respect reading order, parse more sections, support equations, better text formatting.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore(xml-jats): rename PubMed objects to JATS

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-17 10:43:31 +01:00

1.1 KiB

Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too — check out Architecture for more details.

Below you can find a listing of all supported input and output formats.

Supported input formats

Format Description
PDF
DOCX, XLSX, PPTX Default formats in MS Office 2007+, based on Office Open XML
Markdown
AsciiDoc
HTML, XHTML
CSV
PNG, JPEG, TIFF, BMP Image formats

Schema-specific support:

Format Description
USPTO XML XML format followed by USPTO patents
JATS XML XML format followed by JATS articles
Docling JSON JSON-serialized Docling Document

Supported output formats

Format Description
HTML Both image embedding and referencing are supported
Markdown
JSON Lossless serialization of Docling Document
Text Plain text, i.e. without Markdown markers
Doctags