feat: Support AsciiDoc and Markdown input format (#168)

* updated the base-model and added the asciidoc_backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the asciidoc backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Ensure all models work only on valid pages (#158)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* ci: run ci also on forks (#160)


---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* fix: fix legacy doc ref (#162)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* docs: typo fix (#155)

* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: add coverage_threshold to skip OCR for small images (#161)

* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.1.0 [skip ci]

* adding tests for asciidocs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working asciidoc parser

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding test_02.asciidoc

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Drafting Markdown backend via Marko library

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* work in progress on MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* md_backend produces docling document with headers, paragraphs, lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improvements in md parsing

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Detecting and assembling tables in markdown in temporary buffers

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added initial docling table support to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned code, improved logging for MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes MyPy requirements, and rest of pre-commit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed example run_md, added origin info to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* working on asciidocs, struggling with ImageRef

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* able to parse the captions and image uri's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update all backends with proper filename in DocumentOrigin

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update to docling-core v2.1.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for MD Backend, to avoid duplicated text inserts into docling doc

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix styling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added support for code blocks and fenced code in MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned prints

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added proper processing of in-line textual elements for MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issues with duplicated paragraphs and incorrect lists in pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
This commit is contained in:
Christoph Auer
2024-10-23 16:14:26 +02:00
committed by GitHub
parent 3496b4838f
commit 3023f18ba0
52 changed files with 3731 additions and 3517 deletions

View File

@@ -83,21 +83,14 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
# Parses the PPTX into a structured document model.
# origin = DocumentOrigin(filename=self.path_or_stream.name, mimetype=next(iter(FormatToMimeType.get(InputFormat.PPTX))), binary_hash=self.document_hash)
fname = ""
if isinstance(self.path_or_stream, Path):
fname = self.path_or_stream.name
origin = DocumentOrigin(
filename=fname,
filename=self.file.name or "file",
mimetype="application/vnd.ms-powerpoint",
binary_hash=self.document_hash,
)
if len(fname) > 0:
docname = Path(fname).stem
else:
docname = "stream"
doc = DoclingDocument(
name=docname, origin=origin
name=self.file.stem or "file", origin=origin
) # must add origin information
doc = self.walk_linear(self.pptx_obj, doc)
@@ -119,10 +112,16 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
def handle_text_elements(self, shape, parent_slide, slide_ind, doc):
is_a_list = False
is_list_group_created = False
enum_list_item_value = 0
new_list = None
bullet_type = "None"
list_text = ""
list_label = GroupLabel.LIST
prov = self.generate_prov(shape, slide_ind, shape.text.strip())
# Identify if shape contains lists
for paragraph in shape.text_frame.paragraphs:
enum_list_item_value += 1
bullet_type = "None"
# Check if paragraph is a bullet point using the `element` XML
p = paragraph._element
if (
@@ -143,29 +142,32 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
if paragraph.level > 0:
# Most likely a sub-list
is_a_list = True
list_text = paragraph.text.strip()
prov = self.generate_prov(shape, slide_ind, shape.text.strip())
if is_a_list:
# Determine if this is an unordered list or an ordered list.
# Set GroupLabel.ORDERED_LIST when it fits.
list_label = GroupLabel.LIST
if bullet_type == "Numbered":
list_label = GroupLabel.ORDERED_LIST
new_list = doc.add_group(
label=list_label, name=f"list", parent=parent_slide
)
else:
new_list = None
if is_a_list:
_log.debug("LIST DETECTED!")
else:
_log.debug("No List")
# for e in p.iter():
# If there is a list inside of the shape, create a new docling list to assign list items to
# if is_a_list:
# new_list = doc.add_group(
# label=list_label, name=f"list", parent=parent_slide
# )
# Iterate through paragraphs to build up text
for paragraph in shape.text_frame.paragraphs:
# p_text = paragraph.text.strip()
p = paragraph._element
enum_list_item_value += 1
inline_paragraph_text = ""
inline_list_item_text = ""
for e in p.iterfind(".//a:r", namespaces={"a": self.namespaces["a"]}):
if len(e.text.strip()) > 0:
e_is_a_list_item = False
@@ -187,15 +189,17 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
e_is_a_list_item = False
if e_is_a_list_item:
if len(inline_paragraph_text) > 0:
# output accumulated inline text:
doc.add_text(
label=doc_label,
parent=parent_slide,
text=inline_paragraph_text,
prov=prov,
)
# Set marker and enumerated arguments if this is an enumeration element.
enum_marker = str(enum_list_item_value) + "."
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_list,
text=list_text,
prov=prov,
)
inline_list_item_text += e.text
# print(e.text)
else:
# Assign proper label to the text, depending if it's a Title or Section Header
# For other types of text, assign - PARAGRAPH
@@ -210,15 +214,34 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
doc_label = DocItemLabel.TITLE
elif placeholder_type == PP_PLACEHOLDER.SUBTITLE:
DocItemLabel.SECTION_HEADER
enum_list_item_value = 0
inline_paragraph_text += e.text
doc.add_text(
label=doc_label,
parent=parent_slide,
text=list_text,
prov=prov,
)
if len(inline_paragraph_text) > 0:
# output accumulated inline text:
doc.add_text(
label=doc_label,
parent=parent_slide,
text=inline_paragraph_text,
prov=prov,
)
if len(inline_list_item_text) > 0:
enum_marker = ""
if is_numbered:
enum_marker = str(enum_list_item_value) + "."
if not is_list_group_created:
new_list = doc.add_group(
label=list_label, name=f"list", parent=parent_slide
)
is_list_group_created = True
doc.add_list_item(
marker=enum_marker,
enumerated=is_numbered,
parent=new_list,
text=inline_list_item_text,
prov=prov,
)
return
def handle_title(self, shape, parent_slide, slide_ind, doc):
@@ -311,7 +334,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
if len(tcells) > 0:
# If table is not fully empty...
# Create Docling table
doc.add_table(data=data, prov=prov)
doc.add_table(parent=parent_slide, data=data, prov=prov)
return
def walk_linear(self, pptx_obj, doc) -> DoclingDocument: