fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295)
* Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(docx): Improve text parsing (#1268) * chore: bump version to 2.28.4 [skip ci] Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Improve text parsing Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Flexibilize heading detection Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Fix trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Remove trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add visual grounding example (#1270) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat(docx): add text formatting and hyperlink support (#630) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(pptx): check if picture shape has an image attached (#1316) Check if picture shape has an image attached in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: update lock file (#1315) chore: update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add plugins docs (#1319) add plugin docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat: handle <code> tags as code blocks (#1320) handle <code> tags as code blocks Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>
This commit is contained in:
parent
0499cd1c1e
commit
14e9c0ce9a
@ -215,6 +215,9 @@ FUNC = {
|
||||
"coth": "\\coth({fe})",
|
||||
"sec": "\\sec({fe})",
|
||||
"csc": "\\csc({fe})",
|
||||
"mod": "\\mod {fe}",
|
||||
"max": "\\max({fe})",
|
||||
"min": "\\min({fe})",
|
||||
}
|
||||
|
||||
FUNC_PLACE = "{fe}"
|
||||
|
@ -5,6 +5,8 @@ Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
|
||||
On 23/01/2025
|
||||
"""
|
||||
|
||||
import logging
|
||||
|
||||
import lxml.etree as ET
|
||||
from pylatexenc.latexencode import UnicodeToLatexEncoder
|
||||
|
||||
@ -39,6 +41,8 @@ from docling.backend.docx.latex.latex_dict import (
|
||||
|
||||
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
|
||||
|
||||
_log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def load(stream):
|
||||
tree = ET.parse(stream)
|
||||
@ -281,8 +285,10 @@ class oMath2Latex(Tag2Method):
|
||||
if FUNC.get(t):
|
||||
latex_chars.append(FUNC[t])
|
||||
else:
|
||||
raise NotSupport("Not support func %s" % t)
|
||||
else:
|
||||
_log.warning("Function not supported, will default to text: %s", t)
|
||||
if isinstance(t, str):
|
||||
latex_chars.append(t)
|
||||
elif isinstance(t, str):
|
||||
latex_chars.append(t)
|
||||
t = BLANK.join(latex_chars)
|
||||
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
|
||||
@ -382,8 +388,6 @@ class oMath2Latex(Tag2Method):
|
||||
|
||||
out_latex_str = self.u.unicode_to_latex(s)
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if (
|
||||
s.startswith("{") is False
|
||||
and out_latex_str.startswith("{")
|
||||
@ -392,19 +396,13 @@ class oMath2Latex(Tag2Method):
|
||||
):
|
||||
out_latex_str = f" {out_latex_str[1:-1]} "
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if "ensuremath" in out_latex_str:
|
||||
out_latex_str = out_latex_str.replace("\\ensuremath{", " ")
|
||||
out_latex_str = out_latex_str.replace("}", " ")
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
if out_latex_str.strip().startswith("\\text"):
|
||||
out_latex_str = f" \\text{{{out_latex_str}}} "
|
||||
|
||||
# print(s, out_latex_str)
|
||||
|
||||
return out_latex_str
|
||||
|
||||
def do_r(self, elm):
|
||||
@ -415,7 +413,9 @@ class oMath2Latex(Tag2Method):
|
||||
"""
|
||||
_str = []
|
||||
_base_str = []
|
||||
for s in elm.findtext("./{0}t".format(OMML_NS)):
|
||||
found_text = elm.findtext("./{0}t".format(OMML_NS))
|
||||
if found_text:
|
||||
for s in found_text:
|
||||
out_latex_str = self.process_unicode(s)
|
||||
_str.append(out_latex_str)
|
||||
_base_str.append(s)
|
||||
|
@ -58,6 +58,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self.level_at_new_list: Optional[int] = None
|
||||
self.parents: dict[int, Optional[NodeItem]] = {}
|
||||
self.numbered_headers: dict[int, int] = {}
|
||||
self.equation_bookends: str = "<eq>{EQ}</eq>"
|
||||
for i in range(-1, self.max_levels):
|
||||
self.parents[i] = None
|
||||
|
||||
@ -263,6 +264,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
|
||||
label = paragraph.style.style_id
|
||||
name = paragraph.style.name
|
||||
base_style_label = None
|
||||
base_style_name = None
|
||||
if base_style := getattr(paragraph.style, "base_style", None):
|
||||
base_style_label = base_style.style_id
|
||||
base_style_name = base_style.name
|
||||
|
||||
if label is None:
|
||||
return "Normal", None
|
||||
@ -276,6 +282,10 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
return self._get_heading_and_level(label)
|
||||
if "heading" in name.lower():
|
||||
return self._get_heading_and_level(name)
|
||||
if base_style_label and "heading" in base_style_label.lower():
|
||||
return self._get_heading_and_level(base_style_label)
|
||||
if base_style_name and "heading" in base_style_name.lower():
|
||||
return self._get_heading_and_level(base_style_name)
|
||||
|
||||
return label, None
|
||||
|
||||
@ -356,9 +366,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
only_texts.append(subt.text)
|
||||
texts_and_equations.append(subt.text)
|
||||
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
|
||||
latex_equation = str(oMath2Latex(subt))
|
||||
only_equations.append(latex_equation)
|
||||
texts_and_equations.append(latex_equation)
|
||||
latex_equation = str(oMath2Latex(subt)).strip()
|
||||
if len(latex_equation) > 0:
|
||||
only_equations.append(
|
||||
self.equation_bookends.format(EQ=latex_equation)
|
||||
)
|
||||
texts_and_equations.append(
|
||||
self.equation_bookends.format(EQ=latex_equation)
|
||||
)
|
||||
|
||||
if len(only_equations) < 1:
|
||||
return text, []
|
||||
@ -373,21 +388,20 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
|
||||
# Insert equations into original text
|
||||
# This is done to preserve white space structure
|
||||
output_text = ""
|
||||
output_text = text[:]
|
||||
init_i = 0
|
||||
for i_substr, substr in enumerate(texts_and_equations):
|
||||
if substr not in text:
|
||||
if len(substr) == 0:
|
||||
continue
|
||||
|
||||
if substr in output_text[init_i:]:
|
||||
init_i += output_text[init_i:].find(substr) + len(substr)
|
||||
else:
|
||||
if i_substr > 0:
|
||||
i_text_before = text[init_i:].find(
|
||||
texts_and_equations[i_substr - 1]
|
||||
)
|
||||
output_text += text[init_i:][
|
||||
: i_text_before + len(texts_and_equations[i_substr - 1])
|
||||
]
|
||||
init_i += i_text_before + len(texts_and_equations[i_substr - 1])
|
||||
output_text += substr
|
||||
if only_equations.index(substr) == len(only_equations) - 1:
|
||||
output_text += text[init_i:]
|
||||
output_text = output_text[:init_i] + substr + output_text[init_i:]
|
||||
init_i += len(substr)
|
||||
else:
|
||||
output_text = substr + output_text
|
||||
|
||||
return output_text, only_equations
|
||||
|
||||
@ -479,13 +493,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
self._add_header(doc, p_level, text, is_numbered_style)
|
||||
|
||||
elif len(equations) > 0:
|
||||
if (raw_text is None or len(raw_text) == 0) and len(text) > 0:
|
||||
if (raw_text is None or len(raw_text.strip()) == 0) and len(text) > 0:
|
||||
# Standalone equation
|
||||
level = self._get_level()
|
||||
doc.add_text(
|
||||
label=DocItemLabel.FORMULA,
|
||||
parent=self.parents[level - 1],
|
||||
text=text,
|
||||
text=text.replace("<eq>", "").replace("</eq>", ""),
|
||||
)
|
||||
else:
|
||||
# Inline equation
|
||||
@ -498,8 +512,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
if len(text_tmp) == 0:
|
||||
break
|
||||
|
||||
pre_eq_text = text_tmp.split(eq.strip(), maxsplit=1)[0]
|
||||
text_tmp = text_tmp.split(eq.strip(), maxsplit=1)[1]
|
||||
split_text_tmp = text_tmp.split(eq.strip(), maxsplit=1)
|
||||
|
||||
pre_eq_text = split_text_tmp[0]
|
||||
text_tmp = "" if len(split_text_tmp) == 1 else split_text_tmp[1]
|
||||
|
||||
if len(pre_eq_text) > 0:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.PARAGRAPH,
|
||||
@ -509,8 +526,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
doc.add_text(
|
||||
label=DocItemLabel.FORMULA,
|
||||
parent=inline_equation,
|
||||
text=eq,
|
||||
text=eq.replace("<eq>", "").replace("</eq>", ""),
|
||||
)
|
||||
|
||||
if len(text_tmp) > 0:
|
||||
doc.add_text(
|
||||
label=DocItemLabel.PARAGRAPH,
|
||||
|
Loading…
Reference in New Issue
Block a user