fix(markdown): make parsing of rich table cells valid (#1821)
* fix: update md table classification Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix ground truth header changes Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix merge issues Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> * Fix minor ground truth errors Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com> --------- Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
This commit is contained in:
parent
ee4781075a
commit
e79e4f0ab6
@ -335,7 +335,7 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
_log.debug(f" - Paragraph (raw text): {element.children}")
|
_log.debug(f" - Paragraph (raw text): {element.children}")
|
||||||
snippet_text = element.children.strip()
|
snippet_text = element.children.strip()
|
||||||
# Detect start of the table:
|
# Detect start of the table:
|
||||||
if "|" in snippet_text:
|
if "|" in snippet_text or self.in_table:
|
||||||
# most likely part of the markdown table
|
# most likely part of the markdown table
|
||||||
self.in_table = True
|
self.in_table = True
|
||||||
if len(self.md_table_buffer) > 0:
|
if len(self.md_table_buffer) > 0:
|
||||||
|
@ -16,8 +16,17 @@ Create your feature branch: `git checkout -b feature/AmazingFeature` .
|
|||||||
|
|
||||||
# *Whole heading is italic*
|
# *Whole heading is italic*
|
||||||
|
|
||||||
|
- **First** : Lorem ipsum.
|
||||||
|
- **Second** : Dolor `sit` amet.
|
||||||
|
|
||||||
Some *`formatted_code`*
|
Some *`formatted_code`*
|
||||||
|
|
||||||
## *Partially formatted* heading to\_escape `not_to_escape`
|
## *Partially formatted* heading to\_escape `not_to_escape`
|
||||||
|
|
||||||
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||||
|
|
||||||
|
## Table Heading
|
||||||
|
|
||||||
|
| Bold Heading | Italic Heading |
|
||||||
|
|----------------|------------------|
|
||||||
|
| data a | data b |
|
||||||
|
@ -7,8 +7,12 @@ body:
|
|||||||
- $ref: '#/groups/2'
|
- $ref: '#/groups/2'
|
||||||
- $ref: '#/texts/32'
|
- $ref: '#/texts/32'
|
||||||
- $ref: '#/groups/8'
|
- $ref: '#/groups/8'
|
||||||
- $ref: '#/texts/35'
|
- $ref: '#/groups/11'
|
||||||
- $ref: '#/texts/39'
|
- $ref: '#/texts/43'
|
||||||
|
- $ref: '#/texts/47'
|
||||||
|
- $ref: '#/texts/48'
|
||||||
|
- $ref: '#/groups/13'
|
||||||
|
- $ref: '#/tables/0'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: unspecified
|
label: unspecified
|
||||||
name: _root_
|
name: _root_
|
||||||
@ -109,33 +113,205 @@ groups:
|
|||||||
self_ref: '#/groups/7'
|
self_ref: '#/groups/7'
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/texts/33'
|
- $ref: '#/texts/33'
|
||||||
|
- $ref: '#/texts/36'
|
||||||
|
content_layer: body
|
||||||
|
label: list
|
||||||
|
name: list
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
self_ref: '#/groups/8'
|
||||||
|
- children:
|
||||||
- $ref: '#/texts/34'
|
- $ref: '#/texts/34'
|
||||||
|
- $ref: '#/texts/35'
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/texts/33'
|
||||||
|
self_ref: '#/groups/9'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/texts/37'
|
||||||
|
- $ref: '#/texts/38'
|
||||||
|
- $ref: '#/texts/39'
|
||||||
|
- $ref: '#/texts/40'
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/texts/36'
|
||||||
|
self_ref: '#/groups/10'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/texts/41'
|
||||||
|
- $ref: '#/texts/42'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: inline
|
label: inline
|
||||||
name: group
|
name: group
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/body'
|
||||||
self_ref: '#/groups/8'
|
self_ref: '#/groups/11'
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/texts/36'
|
- $ref: '#/texts/44'
|
||||||
- $ref: '#/texts/37'
|
- $ref: '#/texts/45'
|
||||||
- $ref: '#/texts/38'
|
- $ref: '#/texts/46'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: inline
|
label: inline
|
||||||
name: group
|
name: group
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/texts/35'
|
$ref: '#/texts/43'
|
||||||
self_ref: '#/groups/9'
|
self_ref: '#/groups/12'
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: inline
|
||||||
|
name: group
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
self_ref: '#/groups/13'
|
||||||
key_value_items: []
|
key_value_items: []
|
||||||
name: inline_and_formatting
|
name: inline_and_formatting
|
||||||
origin:
|
origin:
|
||||||
binary_hash: 16409076955457599155
|
binary_hash: 14550011543526094526
|
||||||
filename: inline_and_formatting.md
|
filename: inline_and_formatting.md
|
||||||
mimetype: text/markdown
|
mimetype: text/markdown
|
||||||
pages: {}
|
pages: {}
|
||||||
pictures: []
|
pictures: []
|
||||||
schema_name: DoclingDocument
|
schema_name: DoclingDocument
|
||||||
tables: []
|
tables:
|
||||||
|
- annotations: []
|
||||||
|
captions: []
|
||||||
|
children: []
|
||||||
|
content_layer: body
|
||||||
|
data:
|
||||||
|
grid:
|
||||||
|
- - col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- - col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
num_cols: 2
|
||||||
|
num_rows: 2
|
||||||
|
table_cells:
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Bold Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: true
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 1
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 0
|
||||||
|
text: Italic Heading
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 1
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 0
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data a
|
||||||
|
- col_span: 1
|
||||||
|
column_header: false
|
||||||
|
end_col_offset_idx: 2
|
||||||
|
end_row_offset_idx: 2
|
||||||
|
row_header: false
|
||||||
|
row_section: false
|
||||||
|
row_span: 1
|
||||||
|
start_col_offset_idx: 1
|
||||||
|
start_row_offset_idx: 1
|
||||||
|
text: data b
|
||||||
|
footnotes: []
|
||||||
|
label: table
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
references: []
|
||||||
|
self_ref: '#/tables/0'
|
||||||
texts:
|
texts:
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -512,14 +688,108 @@ texts:
|
|||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/32'
|
self_ref: '#/texts/32'
|
||||||
text: Whole heading is italic
|
text: Whole heading is italic
|
||||||
|
- children:
|
||||||
|
- $ref: '#/groups/9'
|
||||||
|
content_layer: body
|
||||||
|
enumerated: false
|
||||||
|
label: list_item
|
||||||
|
marker: '-'
|
||||||
|
orig: ''
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/8'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/33'
|
||||||
|
text: ''
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
formatting:
|
||||||
|
bold: true
|
||||||
|
italic: false
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: text
|
||||||
|
orig: First
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/9'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/34'
|
||||||
|
text: First
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: ': Lorem ipsum.'
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/9'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/35'
|
||||||
|
text: ': Lorem ipsum.'
|
||||||
|
- children:
|
||||||
|
- $ref: '#/groups/10'
|
||||||
|
content_layer: body
|
||||||
|
enumerated: false
|
||||||
|
label: list_item
|
||||||
|
marker: '-'
|
||||||
|
orig: ''
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/8'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/36'
|
||||||
|
text: ''
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
formatting:
|
||||||
|
bold: true
|
||||||
|
italic: false
|
||||||
|
script: baseline
|
||||||
|
strikethrough: false
|
||||||
|
underline: false
|
||||||
|
label: text
|
||||||
|
orig: Second
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/10'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/37'
|
||||||
|
text: Second
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: ': Dolor'
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/10'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/38'
|
||||||
|
text: ': Dolor'
|
||||||
|
- captions: []
|
||||||
|
children: []
|
||||||
|
code_language: unknown
|
||||||
|
content_layer: body
|
||||||
|
footnotes: []
|
||||||
|
label: code
|
||||||
|
orig: sit
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/10'
|
||||||
|
prov: []
|
||||||
|
references: []
|
||||||
|
self_ref: '#/texts/39'
|
||||||
|
text: sit
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: text
|
||||||
|
orig: amet.
|
||||||
|
parent:
|
||||||
|
$ref: '#/groups/10'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/40'
|
||||||
|
text: amet.
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: text
|
label: text
|
||||||
orig: Some
|
orig: Some
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/8'
|
$ref: '#/groups/11'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/33'
|
self_ref: '#/texts/41'
|
||||||
text: Some
|
text: Some
|
||||||
- captions: []
|
- captions: []
|
||||||
children: []
|
children: []
|
||||||
@ -535,13 +805,13 @@ texts:
|
|||||||
label: code
|
label: code
|
||||||
orig: formatted_code
|
orig: formatted_code
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/8'
|
$ref: '#/groups/11'
|
||||||
prov: []
|
prov: []
|
||||||
references: []
|
references: []
|
||||||
self_ref: '#/texts/34'
|
self_ref: '#/texts/42'
|
||||||
text: formatted_code
|
text: formatted_code
|
||||||
- children:
|
- children:
|
||||||
- $ref: '#/groups/9'
|
- $ref: '#/groups/12'
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: section_header
|
label: section_header
|
||||||
level: 1
|
level: 1
|
||||||
@ -549,7 +819,7 @@ texts:
|
|||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/body'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/35'
|
self_ref: '#/texts/43'
|
||||||
text: ''
|
text: ''
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -562,18 +832,18 @@ texts:
|
|||||||
label: text
|
label: text
|
||||||
orig: Partially formatted
|
orig: Partially formatted
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/9'
|
$ref: '#/groups/12'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/36'
|
self_ref: '#/texts/44'
|
||||||
text: Partially formatted
|
text: Partially formatted
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
label: text
|
label: text
|
||||||
orig: heading to_escape
|
orig: heading to_escape
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/9'
|
$ref: '#/groups/12'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/37'
|
self_ref: '#/texts/45'
|
||||||
text: heading to_escape
|
text: heading to_escape
|
||||||
- captions: []
|
- captions: []
|
||||||
children: []
|
children: []
|
||||||
@ -583,10 +853,10 @@ texts:
|
|||||||
label: code
|
label: code
|
||||||
orig: not_to_escape
|
orig: not_to_escape
|
||||||
parent:
|
parent:
|
||||||
$ref: '#/groups/9'
|
$ref: '#/groups/12'
|
||||||
prov: []
|
prov: []
|
||||||
references: []
|
references: []
|
||||||
self_ref: '#/texts/38'
|
self_ref: '#/texts/46'
|
||||||
text: not_to_escape
|
text: not_to_escape
|
||||||
- children: []
|
- children: []
|
||||||
content_layer: body
|
content_layer: body
|
||||||
@ -596,6 +866,16 @@ texts:
|
|||||||
parent:
|
parent:
|
||||||
$ref: '#/body'
|
$ref: '#/body'
|
||||||
prov: []
|
prov: []
|
||||||
self_ref: '#/texts/39'
|
self_ref: '#/texts/47'
|
||||||
text: $$E=mc^2$$
|
text: $$E=mc^2$$
|
||||||
|
- children: []
|
||||||
|
content_layer: body
|
||||||
|
label: section_header
|
||||||
|
level: 1
|
||||||
|
orig: Table Heading
|
||||||
|
parent:
|
||||||
|
$ref: '#/body'
|
||||||
|
prov: []
|
||||||
|
self_ref: '#/texts/48'
|
||||||
|
text: Table Heading
|
||||||
version: 1.4.0
|
version: 1.4.0
|
||||||
|
9
tests/data/md/inline_and_formatting.md
vendored
9
tests/data/md/inline_and_formatting.md
vendored
@ -16,8 +16,17 @@ Create your feature branch: `git checkout -b feature/AmazingFeature`.
|
|||||||
|
|
||||||
# *Whole heading is italic*
|
# *Whole heading is italic*
|
||||||
|
|
||||||
|
- **First**: Lorem ipsum.
|
||||||
|
- **Second**: Dolor `sit` amet.
|
||||||
|
|
||||||
Some *`formatted_code`*
|
Some *`formatted_code`*
|
||||||
|
|
||||||
## *Partially formatted* heading to_escape `not_to_escape`
|
## *Partially formatted* heading to_escape `not_to_escape`
|
||||||
|
|
||||||
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
[$$E=mc^2$$](https://en.wikipedia.org/wiki/Albert_Einstein)
|
||||||
|
|
||||||
|
## Table Heading
|
||||||
|
|
||||||
|
| **Bold Heading** | *Italic Heading* |
|
||||||
|
|------------------|------------------|
|
||||||
|
| data a | data b |
|
||||||
|
Loading…
Reference in New Issue
Block a user