feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP] Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Updated TF version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests, after merging with Main, Switched to Accurate TF model by default Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-03-11 10:53:49 +01:00 · 2025-03-11 10:53:49 +01:00 · eb97357b05
commit eb97357b05
parent 5e30381c0d
43 changed files with 213 additions and 229 deletions
--- a/docling/cli/main.py
+++ b/docling/cli/main.py
@ -210,7 +210,7 @@ def convert(
    table_mode: Annotated[
        TableFormerMode,
        typer.Option(..., help="The mode to use in the table structure model."),
-    ] = TableFormerMode.FAST,
+    ] = TableFormerMode.ACCURATE,
    enrich_code: Annotated[
        bool,
        typer.Option(..., help="Enable the code enrichment model in the pipeline."),
--- a/docling/datamodel/pipeline_options.py
+++ b/docling/datamodel/pipeline_options.py
@ -99,7 +99,7 @@ class TableStructureOptions(BaseModel):
        #        are merged across table columns.
        # False: Let table structure model define the text cells, ignore PDF cells.
    )
-    mode: TableFormerMode = TableFormerMode.FAST
+    mode: TableFormerMode = TableFormerMode.ACCURATE


 class OcrOptions(BaseModel):
--- a/docling/models/table_structure_model.py
+++ b/docling/models/table_structure_model.py
@ -95,7 +95,7 @@ class TableStructureModel(BasePageModel):
            repo_id="ds4sd/docling-models",
            force_download=force,
            local_dir=local_dir,
-            revision="v2.1.0",
+            revision="v2.2.0",
        )

        return Path(download_path)
--- a/docs/usage/index.md
+++ b/docs/usage/index.md
@ -135,7 +135,7 @@ doc_converter = DocumentConverter(
 )
 ```

-Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
+Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.

 ```python
 from docling.datamodel.base_models import InputFormat
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
@ -12,7 +12,7 @@
 </figure>
 <table>
 <location><page_1><loc_52><loc_62><loc_88><loc_71></location>
-<row_0><col_0><col_header>3</col_0><col_1><col_header>1</col_1></row_0>
+<row_0><col_0><col_header>1</col_0></row_0>
 </table>
 <paragraph><location><page_1><loc_52><loc_58><loc_79><loc_60></location>- b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</paragraph>
 <paragraph><location><page_1><loc_52><loc_46><loc_80><loc_47></location>- c. Structure predicted by TableFormer:</paragraph>
@ -25,11 +25,11 @@
 </figure>
 <table>
 <location><page_1><loc_52><loc_37><loc_88><loc_45></location>
-<row_0><col_0><col_header>0</col_0><col_1><col_header>1</col_1><col_2><col_header>1</col_2><col_3><col_header>2 1</col_3><col_4><col_header>2 1</col_4><col_5><body></col_5></row_0>
-<row_1><col_0><body>3</col_0><col_1><body>4</col_1><col_2><body>5 3</col_2><col_3><body>6</col_3><col_4><body>7</col_4><col_5><body></col_5></row_1>
-<row_2><col_0><body>8</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4><col_5><body>2</col_5></row_2>
-<row_3><col_0><body></col_0><col_1><body>13</col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4><col_5><body>2</col_5></row_3>
-<row_4><col_0><body></col_0><col_1><body>17</col_1><col_2><body>18</col_2><col_3><body>19</col_3><col_4><body>20</col_4><col_5><body>2</col_5></row_4>
+<row_0><col_0><body>0</col_0><col_1><body>1 2 1</col_1><col_2><body>1 2 1</col_2><col_3><body>1 2 1</col_3><col_4><body>1 2 1</col_4></row_0>
+<row_1><col_0><body>3</col_0><col_1><body>4 3</col_1><col_2><body>5</col_2><col_3><body>6</col_3><col_4><body>7</col_4></row_1>
+<row_2><col_0><body>8 2</col_0><col_1><body>9</col_1><col_2><body>10</col_2><col_3><body>11</col_3><col_4><body>12</col_4></row_2>
+<row_3><col_0><body>13</col_0><col_1><body></col_1><col_2><body>14</col_2><col_3><body>15</col_3><col_4><body>16</col_4></row_3>
+<row_4><col_0><body>17</col_0><col_1><body>18</col_1><col_2><body></col_2><col_3><body>19</col_3><col_4><body>20</col_4></row_4>
 </table>
 <paragraph><location><page_1><loc_50><loc_16><loc_89><loc_26></location>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</paragraph>
 <paragraph><location><page_1><loc_50><loc_10><loc_89><loc_16></location>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</paragraph>
@ -138,9 +138,9 @@
 <location><page_7><loc_50><loc_62><loc_87><loc_69></location>
 <caption>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
 <row_0><col_0><col_header>Model</col_0><col_1><col_header>Dataset</col_1><col_2><col_header>mAP</col_2><col_3><col_header>mAP (PP)</col_3></row_0>
-<row_1><col_0><body>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
-<row_2><col_0><body>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
-<row_3><col_0><body>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
+<row_1><col_0><row_header>EDD+BBox</col_0><col_1><body>PubTabNet</col_1><col_2><body>79.2</col_2><col_3><body>82.7</col_3></row_1>
+<row_2><col_0><row_header>TableFormer</col_0><col_1><body>PubTabNet</col_1><col_2><body>82.1</col_2><col_3><body>86.8</col_3></row_2>
+<row_3><col_0><row_header>TableFormer</col_0><col_1><body>SynthTabNet</col_1><col_2><body>87.7</col_2><col_3><body>-</col_3></row_3>
 </table>
 <caption><location><page_7><loc_50><loc_57><loc_89><loc_60></location>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
 <paragraph><location><page_7><loc_50><loc_34><loc_89><loc_54></location>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</paragraph>
@ -179,7 +179,7 @@
 <row_6><col_0><row_header>第 17 回人工知能学会全国大会 (2003)</col_0><col_1><body>208</col_1><col_2><body>5</col_2><col_3><body>203</col_3><col_4><body>152</col_4><col_5><body>244</col_5></row_6>
 <row_7><col_0><row_header>自然言語処理研究会第 146 〜 155 回</col_0><col_1><body>98</col_1><col_2><body>2</col_2><col_3><body>96</col_3><col_4><body>150</col_4><col_5><body>232</col_5></row_7>
 <row_8><col_0><row_header>WWW から収集した論文</col_0><col_1><body>107</col_1><col_2><body>73</col_2><col_3><body>34</col_3><col_4><body>147</col_4><col_5><body>96</col_5></row_8>
-<row_9><col_0><body></col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
+<row_9><col_0><row_header>計</col_0><col_1><body>945</col_1><col_2><body>294</col_2><col_3><body>651</col_3><col_4><body>1122</col_4><col_5><body>955</col_5></row_9>
 </table>
 <caption><location><page_8><loc_62><loc_62><loc_90><loc_63></location>Text is aligned to match original for ease of viewing</caption>
 <table>
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.json
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.md
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.md
@ -25,12 +25,12 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
 Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
 <!-- image -->

-| 0   |   1 | 1   |   2 1 |   2 1 |    |
-|-----|-----|-----|-------|-------|----|
-| 3   |   4 | 5 3 |     6 |     7 |    |
-| 8   |   9 | 10  |    11 |    12 | 2  |
-|     |  13 | 14  |    15 |    16 | 2  |
-|     |  17 | 18  |    19 |    20 | 2  |
+| 0   | 1 2 1   | 1 2 1   |   1 2 1 |   1 2 1 |
+|-----|---------|---------|---------|---------|
+| 3   | 4 3     | 5       |       6 |       7 |
+| 8 2 | 9       | 10      |      11 |      12 |
+| 13  |         | 14      |      15 |      16 |
+| 17  | 18      |         |      19 |      20 |

 Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.

@ -241,7 +241,7 @@ Text is aligned to match original for ease of viewing
 | 第 17 回人工知能学会全国大会 (2003)                | 208         | 5              | 203            | 152        | 244        |
 | 自然言語処理研究会第 146 〜 155 回                 | 98          | 2              | 96             | 150        | 232        |
 | WWW から収集した論文                               | 107         | 73             | 34             | 147        | 96         |
-|                                                    | 945         | 294            | 651            | 1122       | 955        |
+| 計                                                 | 945         | 294            | 651            | 1122       | 955        |

 |                          | Shares (in millions)   | Shares (in millions)   | Weighted Average Grant Date Fair Value   | Weighted Average Grant Date Fair Value   |
 |--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json
--- a/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
@ -56,7 +56,7 @@
 <table>
 <location><page_4><loc_16><loc_63><loc_84><loc_83></location>
 <caption>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption>
-<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>% of Total</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
+<row_0><col_0><body></col_0><col_1><body></col_1><col_2><col_header>% of Total</col_2><col_3><col_header>% of Total</col_3><col_4><col_header>% of Total</col_4><col_5><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_5><col_6><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_6><col_7><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_7><col_8><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_8><col_9><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_9><col_10><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_10><col_11><col_header>triple inter-annotator mAP @ 0.5-0.95 (%)</col_11></row_0>
 <row_1><col_0><col_header>class label</col_0><col_1><col_header>Count</col_1><col_2><col_header>Train</col_2><col_3><col_header>Test</col_3><col_4><col_header>Val</col_4><col_5><col_header>All</col_5><col_6><col_header>Fin</col_6><col_7><col_header>Man</col_7><col_8><col_header>Sci</col_8><col_9><col_header>Law</col_9><col_10><col_header>Pat</col_10><col_11><col_header>Ten</col_11></row_1>
 <row_2><col_0><row_header>Caption</col_0><col_1><body>22524</col_1><col_2><body>2.04</col_2><col_3><body>1.77</col_3><col_4><body>2.32</col_4><col_5><body>84-89</col_5><col_6><body>40-61</col_6><col_7><body>86-92</col_7><col_8><body>94-99</col_8><col_9><body>95-99</col_9><col_10><body>69-78</col_10><col_11><body>n/a</col_11></row_2>
 <row_3><col_0><row_header>Footnote</col_0><col_1><body>6318</col_1><col_2><body>0.60</col_2><col_3><body>0.31</col_3><col_4><body>0.58</col_4><col_5><body>83-91</col_5><col_6><body>n/a</col_6><col_7><body>100</col_7><col_8><body>62-88</col_8><col_9><body>85-94</col_9><col_10><body>n/a</col_10><col_11><body>82-97</col_11></row_3>
@ -102,7 +102,7 @@
 <table>
 <location><page_6><loc_10><loc_56><loc_47><loc_75></location>
 <row_0><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>MRCNN</col_2><col_3><col_header>MRCNN</col_3><col_4><col_header>FRCNN</col_4><col_5><col_header>YOLO</col_5></row_0>
-<row_1><col_0><body></col_0><col_1><col_header>human</col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
+<row_1><col_0><body></col_0><col_1><body></col_1><col_2><col_header>R50</col_2><col_3><col_header>R101</col_3><col_4><col_header>R101</col_4><col_5><col_header>v5x6</col_5></row_1>
 <row_2><col_0><row_header>Caption</col_0><col_1><body>84-89</col_1><col_2><body>68.4</col_2><col_3><body>71.5</col_3><col_4><body>70.1</col_4><col_5><body>77.7</col_5></row_2>
 <row_3><col_0><row_header>Footnote</col_0><col_1><body>83-91</col_1><col_2><body>70.9</col_2><col_3><body>71.8</col_3><col_4><body>73.7</col_4><col_5><body>77.2</col_5></row_3>
 <row_4><col_0><row_header>Formula</col_0><col_1><body>83-85</col_1><col_2><body>60.1</col_2><col_3><body>63.4</col_3><col_4><body>63.5</col_4><col_5><body>66.2</col_5></row_4>
@ -130,7 +130,7 @@
 <paragraph><location><page_7><loc_9><loc_84><loc_48><loc_89></location>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</paragraph>
 <table>
 <location><page_7><loc_13><loc_63><loc_44><loc_81></location>
-<row_0><col_0><col_header>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
+<row_0><col_0><body>Class-count</col_0><col_1><col_header>11</col_1><col_2><col_header>6</col_2><col_3><col_header>5</col_3><col_4><col_header>4</col_4></row_0>
 <row_1><col_0><row_header>Caption</col_0><col_1><body>68</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_1>
 <row_2><col_0><row_header>Footnote</col_0><col_1><body>71</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_2>
 <row_3><col_0><row_header>Formula</col_0><col_1><body>60</col_1><col_2><body>Text</col_2><col_3><body>Text</col_3><col_4><body>Text</col_4></row_3>
@ -178,17 +178,17 @@
 <row_1><col_0><col_header>Training on</col_0><col_1><col_header>labels</col_1><col_2><col_header>PLN</col_2><col_3><col_header>DB</col_3><col_4><col_header>DLN</col_4></row_1>
 <row_2><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>96</col_2><col_3><body>43</col_3><col_4><body>23</col_4></row_2>
 <row_3><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>87</col_2><col_3><body>-</col_3><col_4><body>32</col_4></row_3>
-<row_4><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
-<row_5><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
-<row_6><col_0><row_header>PubLayNet (PLN)</col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
+<row_4><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>95</col_2><col_3><body>24</col_3><col_4><body>49</col_4></row_4>
+<row_5><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>96</col_2><col_3><body>-</col_3><col_4><body>42</col_4></row_5>
+<row_6><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>93</col_2><col_3><body>34</col_3><col_4><body>30</col_4></row_6>
 <row_7><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Figure</col_1><col_2><body>77</col_2><col_3><body>71</col_3><col_4><body>31</col_4></row_7>
 <row_8><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>Table</col_1><col_2><body>19</col_2><col_3><body>65</col_3><col_4><body>22</col_4></row_8>
 <row_9><col_0><row_header>DocBank (DB)</col_0><col_1><row_header>total</col_1><col_2><body>48</col_2><col_3><body>68</col_3><col_4><body>27</col_4></row_9>
 <row_10><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Figure</col_1><col_2><body>67</col_2><col_3><body>51</col_3><col_4><body>72</col_4></row_10>
 <row_11><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Sec-header</col_1><col_2><body>53</col_2><col_3><body>-</col_3><col_4><body>68</col_4></row_11>
-<row_12><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
-<row_13><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
-<row_14><col_0><row_header>DocLayNet (DLN)</col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
+<row_12><col_0><body></col_0><col_1><row_header>Table</col_1><col_2><body>87</col_2><col_3><body>43</col_3><col_4><body>82</col_4></row_12>
+<row_13><col_0><body></col_0><col_1><row_header>Text</col_1><col_2><body>77</col_2><col_3><body>-</col_3><col_4><body>84</col_4></row_13>
+<row_14><col_0><body></col_0><col_1><row_header>total</col_1><col_2><body>59</col_2><col_3><body>47</col_3><col_4><body>78</col_4></row_14>
 </table>
 <paragraph><location><page_8><loc_9><loc_44><loc_48><loc_51></location>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</paragraph>
 <paragraph><location><page_8><loc_9><loc_26><loc_48><loc_44></location>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</paragraph>
--- a/tests/data/groundtruth/docling_v1/2206.01062.json
+++ b/tests/data/groundtruth/docling_v1/2206.01062.json
--- a/tests/data/groundtruth/docling_v1/2206.01062.md
+++ b/tests/data/groundtruth/docling_v1/2206.01062.md
@ -98,21 +98,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif

 Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.

-|                |         | % of Total   | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   |
-|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
-| class label    | Count   | Train        | Test         | Val          | All          | Fin                                         | Man                                         | Sci                                         | Law                                         | Pat                                         | Ten                                         |
-| Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89        | 40-61                                       | 86-92                                       | 94-99                                       | 95-99                                       | 69-78                                       | n/a                                         |
-| Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91        | n/a                                         | 100                                         | 62-88                                       | 85-94                                       | n/a                                         | 82-97                                       |
-| Formula        | 25027   | 2.25         | 1.90         | 2.96         | 83-85        | n/a                                         | n/a                                         | 84-87                                       | 86-96                                       | n/a                                         | n/a                                         |
-| List-item      | 185660  | 17.19        | 13.34        | 15.82        | 87-88        | 74-83                                       | 90-92                                       | 97-97                                       | 81-85                                       | 75-88                                       | 93-95                                       |
-| Page-footer    | 70878   | 6.51         | 5.58         | 6.00         | 93-94        | 88-90                                       | 95-96                                       | 100                                         | 92-97                                       | 100                                         | 96-98                                       |
-| Page-header    | 58022   | 5.10         | 6.70         | 5.06         | 85-89        | 66-76                                       | 90-94                                       | 98-100                                      | 91-92                                       | 97-99                                       | 81-86                                       |
-| Picture        | 45976   | 4.21         | 2.78         | 5.31         | 69-71        | 56-59                                       | 82-86                                       | 69-82                                       | 80-95                                       | 66-71                                       | 59-76                                       |
-| Section-header | 142884  | 12.60        | 15.77        | 12.85        | 83-84        | 76-81                                       | 90-92                                       | 94-95                                       | 87-94                                       | 69-73                                       | 78-86                                       |
-| Table          | 34733   | 3.20         | 2.27         | 3.60         | 77-81        | 75-80                                       | 83-86                                       | 98-99                                       | 58-80                                       | 79-84                                       | 70-85                                       |
-| Text           | 510377  | 45.82        | 49.28        | 45.00        | 84-86        | 81-86                                       | 88-93                                       | 89-93                                       | 87-92                                       | 71-79                                       | 87-95                                       |
-| Title          | 5071    | 0.47         | 0.30         | 0.50         | 60-72        | 24-63                                       | 50-63                                       | 94-100                                      | 82-96                                       | 68-79                                       | 24-56                                       |
-| Total          | 1107470 | 941123       | 99816        | 66531        | 82-83        | 71-74                                       | 79-81                                       | 89-94                                       | 86-91                                       | 71-76                                       | 68-85                                       |
+|                |         | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   |
+|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
+| class label    | Count   | Train        | Test         | Val          | All                                         | Fin                                         | Man                                         | Sci                                         | Law                                         | Pat                                         | Ten                                         |
+| Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89                                       | 40-61                                       | 86-92                                       | 94-99                                       | 95-99                                       | 69-78                                       | n/a                                         |
+| Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91                                       | n/a                                         | 100                                         | 62-88                                       | 85-94                                       | n/a                                         | 82-97                                       |
+| Formula        | 25027   | 2.25         | 1.90         | 2.96         | 83-85                                       | n/a                                         | n/a                                         | 84-87                                       | 86-96                                       | n/a                                         | n/a                                         |
+| List-item      | 185660  | 17.19        | 13.34        | 15.82        | 87-88                                       | 74-83                                       | 90-92                                       | 97-97                                       | 81-85                                       | 75-88                                       | 93-95                                       |
+| Page-footer    | 70878   | 6.51         | 5.58         | 6.00         | 93-94                                       | 88-90                                       | 95-96                                       | 100                                         | 92-97                                       | 100                                         | 96-98                                       |
+| Page-header    | 58022   | 5.10         | 6.70         | 5.06         | 85-89                                       | 66-76                                       | 90-94                                       | 98-100                                      | 91-92                                       | 97-99                                       | 81-86                                       |
+| Picture        | 45976   | 4.21         | 2.78         | 5.31         | 69-71                                       | 56-59                                       | 82-86                                       | 69-82                                       | 80-95                                       | 66-71                                       | 59-76                                       |
+| Section-header | 142884  | 12.60        | 15.77        | 12.85        | 83-84                                       | 76-81                                       | 90-92                                       | 94-95                                       | 87-94                                       | 69-73                                       | 78-86                                       |
+| Table          | 34733   | 3.20         | 2.27         | 3.60         | 77-81                                       | 75-80                                       | 83-86                                       | 98-99                                       | 58-80                                       | 79-84                                       | 70-85                                       |
+| Text           | 510377  | 45.82        | 49.28        | 45.00        | 84-86                                       | 81-86                                       | 88-93                                       | 89-93                                       | 87-92                                       | 71-79                                       | 87-95                                       |
+| Title          | 5071    | 0.47         | 0.30         | 0.50         | 60-72                                       | 24-63                                       | 50-63                                       | 94-100                                      | 82-96                                       | 68-79                                       | 24-56                                       |
+| Total          | 1107470 | 941123       | 99816        | 66531        | 82-83                                       | 71-74                                       | 79-81                                       | 89-94                                       | 86-91                                       | 71-76                                       | 68-85                                       |

 Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.
 <!-- image -->
@ -161,7 +161,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D

 |                | human   | MRCNN   | MRCNN   | FRCNN   | YOLO   |
 |----------------|---------|---------|---------|---------|--------|
-|                | human   | R50     | R101    | R101    | v5x6   |
+|                |         | R50     | R101    | R101    | v5x6   |
 | Caption        | 84-89   | 68.4    | 71.5    | 70.1    | 77.7   |
 | Footnote       | 83-91   | 70.9    | 71.8    | 73.7    | 77.2   |
 | Formula        | 83-85   | 60.1    | 63.4    | 63.5    | 66.2   |
@ -252,17 +252,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
 | Training on     | labels     | PLN          | DB           | DLN          |
 | PubLayNet (PLN) | Figure     | 96           | 43           | 23           |
 | PubLayNet (PLN) | Sec-header | 87           | -            | 32           |
-| PubLayNet (PLN) | Table      | 95           | 24           | 49           |
-| PubLayNet (PLN) | Text       | 96           | -            | 42           |
-| PubLayNet (PLN) | total      | 93           | 34           | 30           |
+|                 | Table      | 95           | 24           | 49           |
+|                 | Text       | 96           | -            | 42           |
+|                 | total      | 93           | 34           | 30           |
 | DocBank (DB)    | Figure     | 77           | 71           | 31           |
 | DocBank (DB)    | Table      | 19           | 65           | 22           |
 | DocBank (DB)    | total      | 48           | 68           | 27           |
 | DocLayNet (DLN) | Figure     | 67           | 51           | 72           |
 | DocLayNet (DLN) | Sec-header | 53           | -            | 68           |
-| DocLayNet (DLN) | Table      | 87           | 43           | 82           |
-| DocLayNet (DLN) | Text       | 77           | -            | 84           |
-| DocLayNet (DLN) | total      | 59           | 47           | 78           |
+|                 | Table      | 87           | 43           | 82           |
+|                 | Text       | 77           | -            | 84           |
+|                 | total      | 59           | 47           | 78           |

 Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .

--- a/tests/data/groundtruth/docling_v1/2206.01062.pages.json
+++ b/tests/data/groundtruth/docling_v1/2206.01062.pages.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.doctags.txt
@ -5,13 +5,12 @@
 <table>
 <location><page_1><loc_23><loc_41><loc_78><loc_57></location>
 <caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
-<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
-<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
+<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
+<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
 <row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
-<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
-<row_4><col_0><body></col_0><col_1><body></col_1><col_2><body>OTSL</col_2><col_3><body>0.952 0.923</col_3><col_4><body>0.909</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
-<row_5><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>HTML</col_2><col_3><body>0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
-<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
+<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
+<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
+<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
 </table>
 <caption><location><page_1><loc_22><loc_59><loc_79><loc_66></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
 <subtitle-level-1><location><page_1><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.md
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.md
@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly

 Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

-| #          | #          | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
-|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
-| enc-layers | dec-layers | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
-| 6          | 6          | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
-| 4          | 4          | OTSL HTML  | 0.938       | 0.904       | 0.927       | 0.853       | 1.97        |
-|            |            | OTSL       | 0.952 0.923 | 0.909       | 0.938       | 0.843       | 3.77        |
-| 2          | 4          | HTML       | 0.945       | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
-| 4          | 2          | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |
+| # enc-layers   | # dec-layers   | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
+|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
+| # enc-layers   | # dec-layers   | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
+| 6              | 6              | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
+| 4              | 4              | OTSL HTML  | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77   |
+| 2              | 4              | OTSL HTML  | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
+| 4              | 2              | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |

 ## 5.2 Quantitative Results

--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt
@ -77,13 +77,12 @@
 <table>
 <location><page_9><loc_23><loc_41><loc_78><loc_57></location>
 <caption>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
-<row_0><col_0><col_header>#</col_0><col_1><col_header>#</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
-<row_1><col_0><col_header>enc-layers</col_0><col_1><col_header>dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
+<row_0><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>TEDs</col_5><col_6><col_header>mAP</col_6><col_7><col_header>Inference</col_7></row_0>
+<row_1><col_0><col_header># enc-layers</col_0><col_1><col_header># dec-layers</col_1><col_2><col_header>Language</col_2><col_3><col_header>simple</col_3><col_4><col_header>complex</col_4><col_5><col_header>all</col_5><col_6><col_header>(0.75)</col_6><col_7><col_header>time (secs)</col_7></row_1>
 <row_2><col_0><body>6</col_0><col_1><body>6</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.965 0.969</col_3><col_4><body>0.934 0.927</col_4><col_5><body>0.955 0.955</col_5><col_6><body>0.88 0.857</col_6><col_7><body>2.73 5.39</col_7></row_2>
-<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904</col_4><col_5><body>0.927</col_5><col_6><body>0.853</col_6><col_7><body>1.97</col_7></row_3>
-<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.909 0.897</col_4><col_5><body>0.938</col_5><col_6><body>0.843</col_6><col_7><body>3.77</col_7></row_4>
-<row_5><col_0><body></col_0><col_1><body></col_1><col_2><body>HTML</col_2><col_3><body></col_3><col_4><body>0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_5>
-<row_6><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_6>
+<row_3><col_0><body>4</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.938 0.952</col_3><col_4><body>0.904 0.909</col_4><col_5><body>0.927 0.938</col_5><col_6><body>0.853 0.843</col_6><col_7><body>1.97 3.77</col_7></row_3>
+<row_4><col_0><body>2</col_0><col_1><body>4</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.923 0.945</col_3><col_4><body>0.897 0.901</col_4><col_5><body>0.915 0.931</col_5><col_6><body>0.859 0.834</col_6><col_7><body>1.91 3.81</col_7></row_4>
+<row_5><col_0><body>4</col_0><col_1><body>2</col_1><col_2><body>OTSL HTML</col_2><col_3><body>0.952 0.944</col_3><col_4><body>0.92 0.903</col_4><col_5><body>0.942 0.931</col_5><col_6><body>0.857 0.824</col_6><col_7><body>1.22 2</col_7></row_5>
 </table>
 <caption><location><page_9><loc_22><loc_59><loc_79><loc_65></location>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption>
 <subtitle-level-1><location><page_9><loc_22><loc_35><loc_43><loc_36></location>5.2 Quantitative Results</subtitle-level-1>
@ -92,14 +91,11 @@
 <table>
 <location><page_10><loc_23><loc_67><loc_77><loc_80></location>
 <caption>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
-<row_0><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
-<row_1><col_0><body></col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
-<row_2><col_0><row_header>PubTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.965</col_2><col_3><body>0.934</col_3><col_4><body>0.955</col_4><col_5><body>0.88</col_5><col_6><body>2.73</col_6></row_2>
-<row_3><col_0><row_header>PubTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.969</col_2><col_3><body>0.927</col_3><col_4><body>0.955</col_4><col_5><body>0.857</col_5><col_6><body>5.39</col_6></row_3>
-<row_4><col_0><row_header>FinTabNet</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.955</col_2><col_3><body>0.961</col_3><col_4><body>0.959</col_4><col_5><body>0.862</col_5><col_6><body>1.85</col_6></row_4>
-<row_5><col_0><row_header>FinTabNet</col_0><col_1><row_header>HTML</col_1><col_2><body>0.917</col_2><col_3><body>0.922</col_3><col_4><body>0.92</col_4><col_5><body>0.722</col_5><col_6><body>3.26</col_6></row_5>
-<row_6><col_0><row_header>PubTables-1M</col_0><col_1><row_header>OTSL</col_1><col_2><body>0.987</col_2><col_3><body>0.964</col_3><col_4><body>0.977</col_4><col_5><body>0.896</col_5><col_6><body>1.79</col_6></row_6>
-<row_7><col_0><row_header>PubTables-1M</col_0><col_1><row_header>HTML</col_1><col_2><body>0.983</col_2><col_3><body>0.944</col_3><col_4><body>0.966</col_4><col_5><body>0.889</col_5><col_6><body>3.26</col_6></row_7>
+<row_0><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>TEDs</col_2><col_3><col_header>TEDs</col_3><col_4><col_header>TEDs</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_0>
+<row_1><col_0><col_header>Data set</col_0><col_1><col_header>Language</col_1><col_2><col_header>simple</col_2><col_3><col_header>complex</col_3><col_4><col_header>all</col_4><col_5><col_header>mAP(0.75)</col_5><col_6><col_header>Inference time (secs)</col_6></row_1>
+<row_2><col_0><body>PubTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.965 0.969</col_2><col_3><body>0.934 0.927</col_3><col_4><body>0.955 0.955</col_4><col_5><body>0.88 0.857</col_5><col_6><body>2.73 5.39</col_6></row_2>
+<row_3><col_0><body>FinTabNet</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.955 0.917</col_2><col_3><body>0.961 0.922</col_3><col_4><body>0.959 0.92</col_4><col_5><body>0.862 0.722</col_5><col_6><body>1.85 3.26</col_6></row_3>
+<row_4><col_0><body>PubTables-1M</col_0><col_1><body>OTSL HTML</col_1><col_2><body>0.987 0.983</col_2><col_3><body>0.964 0.944</col_3><col_4><body>0.977 0.966</col_4><col_5><body>0.896 0.889</col_5><col_6><body>1.79 3.26</col_6></row_4>
 </table>
 <caption><location><page_10><loc_22><loc_82><loc_79><loc_85></location>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption>
 <subtitle-level-1><location><page_10><loc_22><loc_62><loc_42><loc_64></location>5.3 Qualitative Results</subtitle-level-1>
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.json
--- a/tests/data/groundtruth/docling_v1/2305.03393v1.md
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.md
@ -130,14 +130,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly

 Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

-| #          | #          | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
-|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
-| enc-layers | dec-layers | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
-| 6          | 6          | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
-| 4          | 4          | OTSL HTML  | 0.938 0.952 | 0.904       | 0.927       | 0.853       | 1.97        |
-| 2          | 4          | OTSL       | 0.923 0.945 | 0.909 0.897 | 0.938       | 0.843       | 3.77        |
-|            |            | HTML       |             | 0.901       | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
-| 4          | 2          | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |
+| # enc-layers   | # dec-layers   | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
+|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
+| # enc-layers   | # dec-layers   | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
+| 6              | 6              | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
+| 4              | 4              | OTSL HTML  | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77   |
+| 2              | 4              | OTSL HTML  | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
+| 4              | 2              | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |

 ## 5.2 Quantitative Results

@ -147,15 +146,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied

 Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).

-|              | Language   | TEDs   | TEDs    | TEDs   | mAP(0.75)   | Inference time (secs)   |
-|--------------|------------|--------|---------|--------|-------------|-------------------------|
-|              | Language   | simple | complex | all    | mAP(0.75)   | Inference time (secs)   |
-| PubTabNet    | OTSL       | 0.965  | 0.934   | 0.955  | 0.88        | 2.73                    |
-| PubTabNet    | HTML       | 0.969  | 0.927   | 0.955  | 0.857       | 5.39                    |
-| FinTabNet    | OTSL       | 0.955  | 0.961   | 0.959  | 0.862       | 1.85                    |
-| FinTabNet    | HTML       | 0.917  | 0.922   | 0.92   | 0.722       | 3.26                    |
-| PubTables-1M | OTSL       | 0.987  | 0.964   | 0.977  | 0.896       | 1.79                    |
-| PubTables-1M | HTML       | 0.983  | 0.944   | 0.966  | 0.889       | 3.26                    |
+| Data set     | Language   | TEDs        | TEDs        | TEDs        | mAP(0.75)   | Inference time (secs)   |
+|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
+| Data set     | Language   | simple      | complex     | all         | mAP(0.75)   | Inference time (secs)   |
+| PubTabNet    | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39               |
+| FinTabNet    | OTSL HTML  | 0.955 0.917 | 0.961 0.922 | 0.959 0.92  | 0.862 0.722 | 1.85 3.26               |
+| PubTables-1M | OTSL HTML  | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26               |

 ## 5.3 Qualitative Results

--- a/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.doctags.txt
@ -130,7 +130,7 @@
 <table>
 <location><page_9><loc_11><loc_9><loc_89><loc_50></location>
 <caption>Table 2-2 Comparison of the different function usage IDs and *JOBCTL authority</caption>
-<row_0><col_0><row_header>User action</col_0><col_1><body>*JOBCTL</col_1><col_2><body>QIBM_DB_SECADM</col_2><col_3><body>QIBM_DB_SQLADM</col_3><col_4><body>QIBM_DB_SYSMON</col_4><col_5><body>No Authority</col_5></row_0>
+<row_0><col_0><body>User action</col_0><col_1><col_header>*JOBCTL</col_1><col_2><col_header>QIBM_DB_SECADM</col_2><col_3><col_header>QIBM_DB_SQLADM</col_3><col_4><col_header>QIBM_DB_SYSMON</col_4><col_5><col_header>No Authority</col_5></row_0>
 <row_1><col_0><row_header>SET CURRENT DEGREE  (SQL statement)</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_1>
 <row_2><col_0><row_header>CHGQRYA  command targeting a different user’s job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_2>
 <row_3><col_0><row_header>STRDBMON  or  ENDDBMON  commands targeting a different user’s job</col_0><col_1><body>X</col_1><col_2><body></col_2><col_3><body>X</col_3><col_4><body></col_4><col_5><body></col_5></row_3>
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.json
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.json
--- a/tests/data/groundtruth/docling_v1/redp5110_sampled.pages.json
+++ b/tests/data/groundtruth/docling_v1/redp5110_sampled.pages.json
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
@ -8,13 +8,13 @@
 <section_header_level_1><loc_41><loc_341><loc_104><loc_348>1. Introduction</section_header_level_1>
 <text><loc_41><loc_354><loc_234><loc_450>The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.</text>
 <picture><loc_258><loc_144><loc_439><loc_191></picture>
-<otsl><loc_258><loc_144><loc_439><loc_191><ched>3<ched>1<nl></otsl>
+<otsl><loc_258><loc_144><loc_439><loc_191><ched>1<nl></otsl>
 <unordered_list><list_item><loc_258><loc_198><loc_397><loc_210>b. Red-annotation of bounding boxes, Blue-predictions by TableFormer</list_item>
 <list_item><loc_258><loc_265><loc_401><loc_271>c. Structure predicted by TableFormer:</list_item>
 </unordered_list>
 <picture><loc_257><loc_213><loc_441><loc_259></picture>
 <picture><loc_258><loc_274><loc_439><loc_313><caption><loc_252><loc_325><loc_445><loc_353>Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.</caption></picture>
-<otsl><loc_258><loc_274><loc_439><loc_313><ched>0<ched>1<lcel><ched>2 1<lcel><ecel><nl><fcel>3<fcel>4<fcel>5 3<fcel>6<fcel>7<ecel><nl><fcel>8<fcel>9<fcel>10<fcel>11<fcel>12<fcel>2<nl><ecel><fcel>13<fcel>14<fcel>15<fcel>16<ucel><nl><ecel><fcel>17<fcel>18<fcel>19<fcel>20<ucel><nl></otsl>
+<otsl><loc_258><loc_274><loc_439><loc_313><fcel>0<fcel>1 2 1<lcel><lcel><lcel><nl><fcel>3<fcel>4 3<fcel>5<fcel>6<fcel>7<nl><fcel>8 2<fcel>9<fcel>10<fcel>11<fcel>12<nl><fcel>13<ecel><fcel>14<fcel>15<fcel>16<nl><fcel>17<fcel>18<ecel><fcel>19<fcel>20<nl></otsl>
 <text><loc_252><loc_369><loc_445><loc_420>Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.</text>
 <text><loc_252><loc_422><loc_445><loc_450>The first problem is called table-location and has been previously addressed [30, 38, 19, 21, 23, 26, 8] with stateof-the-art object-detection networks (e.g. YOLO and later on Mask-RCNN [9]). For all practical purposes, it can be</text>
 <page_footer><loc_241><loc_463><loc_245><loc_469>1</page_footer>
@ -102,7 +102,7 @@
 <text><loc_41><loc_374><loc_234><loc_387>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</text>
 <text><loc_41><loc_389><loc_214><loc_395>FT: Model was trained on PubTabNet then finetuned.</text>
 <text><loc_41><loc_407><loc_234><loc_450><loc_41><loc_407><loc_234><loc_450>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
-<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><fcel>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><fcel>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><fcel>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
+<otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><rhed>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><rhed>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><rhed>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
 <text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
 <otsl><loc_272><loc_341><loc_426><loc_406><fcel>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
 <page_footer><loc_241><loc_463><loc_245><loc_469>7</page_footer>
@ -114,7 +114,7 @@
 <section_header_level_1><loc_249><loc_60><loc_352><loc_64>Example table from FinTabNet:</section_header_level_1>
 <picture><loc_41><loc_65><loc_246><loc_118></picture>
 <picture><loc_250><loc_62><loc_453><loc_114><caption><loc_44><loc_131><loc_315><loc_136>b. Structure predicted by TableFormer, with superimposed matched PDF cell text:</caption></picture>
-<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><ecel><fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
+<otsl><loc_44><loc_138><loc_244><loc_185><ecel><ecel><ched>論文ファイル<lcel><ched>参考文献<lcel><nl><ched>出典<ched>ファイル 数<ched>英語<ched>日本語<ched>英語<ched>日本語<nl><rhed>Association for Computational Linguistics(ACL2003)<fcel>65<fcel>65<fcel>0<fcel>150<fcel>0<nl><rhed>Computational Linguistics(COLING2002)<fcel>140<fcel>140<fcel>0<fcel>150<fcel>0<nl><rhed>電気情報通信学会 2003 年総合大会<fcel>150<fcel>8<fcel>142<fcel>223<fcel>147<nl><rhed>情報処理学会第 65 回全国大会 (2003)<fcel>177<fcel>1<fcel>176<fcel>150<fcel>236<nl><rhed>第 17 回人工知能学会全国大会 (2003)<fcel>208<fcel>5<fcel>203<fcel>152<fcel>244<nl><rhed>自然言語処理研究会第 146 〜 155 回<fcel>98<fcel>2<fcel>96<fcel>150<fcel>232<nl><rhed>WWW から収集した論文<fcel>107<fcel>73<fcel>34<fcel>147<fcel>96<nl><rhed>計<fcel>945<fcel>294<fcel>651<fcel>1122<fcel>955<nl><caption><loc_311><loc_185><loc_449><loc_189>Text is aligned to match original for ease of viewing</caption></otsl>
 <otsl><loc_249><loc_138><loc_450><loc_182><ecel><ched>Shares (in millions)<lcel><ched>Weighted Average Grant Date Fair Value<lcel><nl><ecel><ched>RS U s<ched>PSUs<ched>RSUs<ched>PSUs<nl><rhed>Nonvested on Janua ry 1<fcel>1. 1<fcel>0.3<fcel>90.10 $<fcel>$ 91.19<nl><rhed>Granted<fcel>0. 5<fcel>0.1<fcel>117.44<fcel>122.41<nl><rhed>Vested<fcel>(0. 5 )<fcel>(0.1)<fcel>87.08<fcel>81.14<nl><rhed>Canceled or forfeited<fcel>(0. 1 )<fcel>-<fcel>102.01<fcel>92.18<nl><rhed>Nonvested on December 31<fcel>1.0<fcel>0.3<fcel>104.85 $<fcel>$ 104.51<nl></otsl>
 <picture><loc_42><loc_240><loc_173><loc_280><caption><loc_51><loc_290><loc_435><loc_295>Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.</caption></picture>
 <picture><loc_177><loc_240><loc_307><loc_280><caption><loc_41><loc_203><loc_445><loc_231>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption></picture>
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.json
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.md
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.md
@ -25,12 +25,12 @@ Figure 1: Picture of a table with subtle, complex features such as (1) multi-col

 <!-- image -->

-| 0   |   1 | 1   |   2 1 |   2 1 |    |
-|-----|-----|-----|-------|-------|----|
-| 3   |   4 | 5 3 |     6 |     7 |    |
-| 8   |   9 | 10  |    11 |    12 | 2  |
-|     |  13 | 14  |    15 |    16 | 2  |
-|     |  17 | 18  |    19 |    20 | 2  |
+| 0   | 1 2 1   | 1 2 1   |   1 2 1 |   1 2 1 |
+|-----|---------|---------|---------|---------|
+| 3   | 4 3     | 5       |       6 |       7 |
+| 8 2 | 9       | 10      |      11 |      12 |
+| 13  |         | 14      |      15 |      16 |
+| 17  | 18      |         |      19 |      20 |

 Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.

@ -247,7 +247,7 @@ Text is aligned to match original for ease of viewing
 | 第 17 回人工知能学会全国大会 (2003)                | 208         | 5              | 203            | 152        | 244        |
 | 自然言語処理研究会第 146 〜 155 回                 | 98          | 2              | 96             | 150        | 232        |
 | WWW から収集した論文                               | 107         | 73             | 34             | 147        | 96         |
-|                                                    | 945         | 294            | 651            | 1122       | 955        |
+| 計                                                 | 945         | 294            | 651            | 1122       | 955        |

 |                          | Shares (in millions)   | Shares (in millions)   | Weighted Average Grant Date Fair Value   | Weighted Average Grant Date Fair Value   |
 |--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.pages.json
--- a/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
@ -58,7 +58,7 @@
 <text><loc_260><loc_399><loc_457><loc_446>The annotation campaign was carried out in four phases. In phase one, we identified and prepared the data sources for annotation. In phase two, we determined the class labels and how annotations should be done on the documents in order to obtain maximum consistency. The latter was guided by a detailed requirement analysis and exhaustive experiments. In phase three, we trained the annotation staff and performed exams for quality assurance. In phase four,</text>
 <page_break>
 <page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
-<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
+<otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
 <picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
 <text><loc_44><loc_400><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
 <text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
@ -88,7 +88,7 @@
 <page_break>
 <page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
 <text><loc_44><loc_55><loc_242><loc_116>Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on DocLayNet test set. The MRCNN (Mask R-CNN) and FRCNN (Faster R-CNN) models with ResNet-50 or ResNet-101 backbone were trained based on the network architectures from the detectron2 model zoo (Mask R-CNN R50, R101-FPN 3x, Faster R-CNN R101-FPN 3x), with default configurations. The YOLO implementation utilized was YOLOv5x6 [13]. All models were initialised using pre-trained weights from the COCO 2017 dataset.</text>
-<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ucel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
+<otsl><loc_51><loc_124><loc_233><loc_222><ecel><ched>human<ched>MRCNN<lcel><ched>FRCNN<ched>YOLO<nl><ecel><ecel><ched>R50<ched>R101<ched>R101<ched>v5x6<nl><rhed>Caption<fcel>84-89<fcel>68.4<fcel>71.5<fcel>70.1<fcel>77.7<nl><rhed>Footnote<fcel>83-91<fcel>70.9<fcel>71.8<fcel>73.7<fcel>77.2<nl><rhed>Formula<fcel>83-85<fcel>60.1<fcel>63.4<fcel>63.5<fcel>66.2<nl><rhed>List-item<fcel>87-88<fcel>81.2<fcel>80.8<fcel>81.0<fcel>86.2<nl><rhed>Page-footer<fcel>93-94<fcel>61.6<fcel>59.3<fcel>58.9<fcel>61.1<nl><rhed>Page-header<fcel>85-89<fcel>71.9<fcel>70.0<fcel>72.0<fcel>67.9<nl><rhed>Picture<fcel>69-71<fcel>71.7<fcel>72.7<fcel>72.0<fcel>77.1<nl><rhed>Section-header<fcel>83-84<fcel>67.6<fcel>69.3<fcel>68.4<fcel>74.6<nl><rhed>Table<fcel>77-81<fcel>82.2<fcel>82.9<fcel>82.2<fcel>86.3<nl><rhed>Text<fcel>84-86<fcel>84.6<fcel>85.8<fcel>85.4<fcel>88.1<nl><rhed>Title<fcel>60-72<fcel>76.7<fcel>80.4<fcel>79.9<fcel>82.7<nl><rhed>All<fcel>82-83<fcel>72.4<fcel>73.5<fcel>73.4<fcel>76.8<nl></otsl>
 <text><loc_44><loc_234><loc_241><loc_364>to avoid this at any cost in order to have clear, unbiased baseline numbers for human document-layout annotation. Third, we introduced the feature of snapping boxes around text segments to obtain a pixel-accurate annotation and again reduce time and effort. The CCS annotation tool automatically shrinks every user-drawn box to the minimum bounding-box around the enclosed text-cells for all purely text-based segments, which excludes only Table and Picture . For the latter, we instructed annotation staff to minimise inclusion of surrounding whitespace while including all graphical lines. A downside of snapping boxes to enclosed text cells is that some wrongly parsed PDF pages cannot be annotated correctly and need to be skipped. Fourth, we established a way to flag pages as rejected for cases where no valid annotation according to the label guidelines could be achieved. Example cases for this would be PDF pages that render incorrectly or contain layouts that are impossible to capture with non-overlapping rectangles. Such rejected pages are not contained in the final dataset. With all these measures in place, experienced annotation staff managed to annotate a single page in a typical timeframe of 20s to 60s, depending on its complexity.</text>
 <section_header_level_1><loc_44><loc_371><loc_120><loc_378>5 EXPERIMENTS</section_header_level_1>
 <text><loc_44><loc_387><loc_241><loc_448>The primary goal of DocLayNet is to obtain high-quality ML models capable of accurate document-layout analysis on a wide variety of challenging layouts. As discussed in Section 2, object detection models are currently the easiest to use, due to the standardisation of ground-truth data in COCO format [16] and the availability of general frameworks such as detectron2 [17]. Furthermore, baseline numbers in PubLayNet and DocBank were obtained using standard object detection models such as Mask R-CNN and Faster R-CNN. As such, we will relate to these object detection methods in this</text>
@ -101,7 +101,7 @@
 <page_header><loc_44><loc_38><loc_284><loc_43>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</page_header>
 <page_header><loc_299><loc_38><loc_456><loc_43>KDD ’22, August 14-18, 2022, Washington, DC, USA</page_header>
 <text><loc_44><loc_55><loc_242><loc_81>Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.</text>
-<otsl><loc_66><loc_95><loc_218><loc_187><ched>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
+<otsl><loc_66><loc_95><loc_218><loc_187><fcel>Class-count<ched>11<ched>6<ched>5<ched>4<nl><rhed>Caption<fcel>68<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Footnote<fcel>71<fcel>Text<fcel>Text<fcel>Text<nl><rhed>Formula<fcel>60<fcel>Text<fcel>Text<fcel>Text<nl><rhed>List-item<fcel>81<fcel>Text<fcel>82<fcel>Text<nl><rhed>Page-footer<fcel>62<fcel>62<fcel>-<fcel>-<nl><rhed>Page-header<fcel>72<fcel>68<fcel>-<fcel>-<nl><rhed>Picture<fcel>72<fcel>72<fcel>72<fcel>72<nl><rhed>Section-header<fcel>68<fcel>67<fcel>69<fcel>68<nl><rhed>Table<fcel>82<fcel>83<fcel>82<fcel>82<nl><rhed>Text<fcel>85<fcel>84<fcel>84<fcel>84<nl><rhed>Title<fcel>77<fcel>Sec.-h.<fcel>Sec.-h.<fcel>Sec.-h.<nl><rhed>Overall<fcel>72<fcel>73<fcel>78<fcel>77<nl></otsl>
 <section_header_level_1><loc_44><loc_202><loc_107><loc_208>Learning Curve</section_header_level_1>
 <text><loc_43><loc_211><loc_241><loc_334>One of the fundamental questions related to any dataset is if it is "large enough". To answer this question for DocLayNet, we performed a data ablation study in which we evaluated a Mask R-CNN model trained on increasing fractions of the DocLayNet dataset. As can be seen in Figure 5, the mAP score rises sharply in the beginning and eventually levels out. To estimate the error-bar on the metrics, we ran the training five times on the entire data-set. This resulted in a 1% error-bar, depicted by the shaded area in Figure 5. In the inset of Figure 5, we show the exact same data-points, but with a logarithmic scale on the x-axis. As is expected, the mAP score increases linearly as a function of the data-size in the inset. The curve ultimately flattens out between the 80% and 100% mark, with the 80% mark falling within the error-bars of the 100% mark. This provides a good indication that the model would not improve significantly by yet increasing the data size. Rather, it would probably benefit more from improved data consistency (as discussed in Section 3), data augmentation methods [23], or the addition of more document categories and styles.</text>
 <section_header_level_1><loc_44><loc_342><loc_134><loc_349>Impact of Class Labels</section_header_level_1>
@ -116,7 +116,7 @@
 <page_break>
 <page_header><loc_44><loc_38><loc_456><loc_43>KDD '22, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar</page_header>
 <text><loc_44><loc_55><loc_242><loc_95>Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.</text>
-<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ucel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ucel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ucel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ucel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ucel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ucel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
+<otsl><loc_59><loc_109><loc_225><loc_215><ecel><ecel><ched>Testing on<lcel><lcel><nl><ched>Training on<ched>labels<ched>PLN<ched>DB<ched>DLN<nl><rhed>PubLayNet (PLN)<rhed>Figure<fcel>96<fcel>43<fcel>23<nl><ucel><rhed>Sec-header<fcel>87<fcel>-<fcel>32<nl><ecel><rhed>Table<fcel>95<fcel>24<fcel>49<nl><ecel><rhed>Text<fcel>96<fcel>-<fcel>42<nl><ecel><rhed>total<fcel>93<fcel>34<fcel>30<nl><rhed>DocBank (DB)<rhed>Figure<fcel>77<fcel>71<fcel>31<nl><ucel><rhed>Table<fcel>19<fcel>65<fcel>22<nl><ucel><rhed>total<fcel>48<fcel>68<fcel>27<nl><rhed>DocLayNet (DLN)<rhed>Figure<fcel>67<fcel>51<fcel>72<nl><ucel><rhed>Sec-header<fcel>53<fcel>-<fcel>68<nl><ecel><rhed>Table<fcel>87<fcel>43<fcel>82<nl><ecel><rhed>Text<fcel>77<fcel>-<fcel>84<nl><ecel><rhed>total<fcel>59<fcel>47<fcel>78<nl></otsl>
 <text><loc_44><loc_247><loc_240><loc_280>Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .</text>
 <text><loc_44><loc_281><loc_241><loc_370>For comparison of DocBank with DocLayNet, we trained only on Picture and Table clusters of each dataset. We had to exclude Text because successive paragraphs are often grouped together into a single object in DocBank. This paragraph grouping is incompatible with the individual paragraphs of DocLayNet. As can be seen in Table 5, DocLayNet trained models yield better performance compared to the previous datasets. It is noteworthy that the models trained on PubLayNet and DocBank perform very well on their own test set, but have a much lower performance on the foreign datasets. While this also applies to DocLayNet, the difference is far less pronounced. Thus we conclude that DocLayNet trained models are overall more robust and will produce better results for challenging, unseen layouts.</text>
 <section_header_level_1><loc_44><loc_382><loc_127><loc_388>Example Predictions</section_header_level_1>
--- a/tests/data/groundtruth/docling_v2/2206.01062.json
+++ b/tests/data/groundtruth/docling_v2/2206.01062.json
--- a/tests/data/groundtruth/docling_v2/2206.01062.md
+++ b/tests/data/groundtruth/docling_v2/2206.01062.md
@ -97,21 +97,21 @@ The annotation campaign was carried out in four phases. In phase one, we identif

 Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.

-|                |         | % of Total   | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   |
-|----------------|---------|--------------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
-| class label    | Count   | Train        | Test         | Val          | All          | Fin                                         | Man                                         | Sci                                         | Law                                         | Pat                                         | Ten                                         |
-| Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89        | 40-61                                       | 86-92                                       | 94-99                                       | 95-99                                       | 69-78                                       | n/a                                         |
-| Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91        | n/a                                         | 100                                         | 62-88                                       | 85-94                                       | n/a                                         | 82-97                                       |
-| Formula        | 25027   | 2.25         | 1.90         | 2.96         | 83-85        | n/a                                         | n/a                                         | 84-87                                       | 86-96                                       | n/a                                         | n/a                                         |
-| List-item      | 185660  | 17.19        | 13.34        | 15.82        | 87-88        | 74-83                                       | 90-92                                       | 97-97                                       | 81-85                                       | 75-88                                       | 93-95                                       |
-| Page-footer    | 70878   | 6.51         | 5.58         | 6.00         | 93-94        | 88-90                                       | 95-96                                       | 100                                         | 92-97                                       | 100                                         | 96-98                                       |
-| Page-header    | 58022   | 5.10         | 6.70         | 5.06         | 85-89        | 66-76                                       | 90-94                                       | 98-100                                      | 91-92                                       | 97-99                                       | 81-86                                       |
-| Picture        | 45976   | 4.21         | 2.78         | 5.31         | 69-71        | 56-59                                       | 82-86                                       | 69-82                                       | 80-95                                       | 66-71                                       | 59-76                                       |
-| Section-header | 142884  | 12.60        | 15.77        | 12.85        | 83-84        | 76-81                                       | 90-92                                       | 94-95                                       | 87-94                                       | 69-73                                       | 78-86                                       |
-| Table          | 34733   | 3.20         | 2.27         | 3.60         | 77-81        | 75-80                                       | 83-86                                       | 98-99                                       | 58-80                                       | 79-84                                       | 70-85                                       |
-| Text           | 510377  | 45.82        | 49.28        | 45.00        | 84-86        | 81-86                                       | 88-93                                       | 89-93                                       | 87-92                                       | 71-79                                       | 87-95                                       |
-| Title          | 5071    | 0.47         | 0.30         | 0.50         | 60-72        | 24-63                                       | 50-63                                       | 94-100                                      | 82-96                                       | 68-79                                       | 24-56                                       |
-| Total          | 1107470 | 941123       | 99816        | 66531        | 82-83        | 71-74                                       | 79-81                                       | 89-94                                       | 86-91                                       | 71-76                                       | 68-85                                       |
+|                |         | % of Total   | % of Total   | % of Total   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   | triple inter-annotator mAP @ 0.5-0.95 (%)   |
+|----------------|---------|--------------|--------------|--------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|
+| class label    | Count   | Train        | Test         | Val          | All                                         | Fin                                         | Man                                         | Sci                                         | Law                                         | Pat                                         | Ten                                         |
+| Caption        | 22524   | 2.04         | 1.77         | 2.32         | 84-89                                       | 40-61                                       | 86-92                                       | 94-99                                       | 95-99                                       | 69-78                                       | n/a                                         |
+| Footnote       | 6318    | 0.60         | 0.31         | 0.58         | 83-91                                       | n/a                                         | 100                                         | 62-88                                       | 85-94                                       | n/a                                         | 82-97                                       |
+| Formula        | 25027   | 2.25         | 1.90         | 2.96         | 83-85                                       | n/a                                         | n/a                                         | 84-87                                       | 86-96                                       | n/a                                         | n/a                                         |
+| List-item      | 185660  | 17.19        | 13.34        | 15.82        | 87-88                                       | 74-83                                       | 90-92                                       | 97-97                                       | 81-85                                       | 75-88                                       | 93-95                                       |
+| Page-footer    | 70878   | 6.51         | 5.58         | 6.00         | 93-94                                       | 88-90                                       | 95-96                                       | 100                                         | 92-97                                       | 100                                         | 96-98                                       |
+| Page-header    | 58022   | 5.10         | 6.70         | 5.06         | 85-89                                       | 66-76                                       | 90-94                                       | 98-100                                      | 91-92                                       | 97-99                                       | 81-86                                       |
+| Picture        | 45976   | 4.21         | 2.78         | 5.31         | 69-71                                       | 56-59                                       | 82-86                                       | 69-82                                       | 80-95                                       | 66-71                                       | 59-76                                       |
+| Section-header | 142884  | 12.60        | 15.77        | 12.85        | 83-84                                       | 76-81                                       | 90-92                                       | 94-95                                       | 87-94                                       | 69-73                                       | 78-86                                       |
+| Table          | 34733   | 3.20         | 2.27         | 3.60         | 77-81                                       | 75-80                                       | 83-86                                       | 98-99                                       | 58-80                                       | 79-84                                       | 70-85                                       |
+| Text           | 510377  | 45.82        | 49.28        | 45.00        | 84-86                                       | 81-86                                       | 88-93                                       | 89-93                                       | 87-92                                       | 71-79                                       | 87-95                                       |
+| Title          | 5071    | 0.47         | 0.30         | 0.50         | 60-72                                       | 24-63                                       | 50-63                                       | 94-100                                      | 82-96                                       | 68-79                                       | 24-56                                       |
+| Total          | 1107470 | 941123       | 99816        | 66531        | 82-83                                       | 71-74                                       | 79-81                                       | 89-94                                       | 86-91                                       | 71-76                                       | 68-85                                       |

 Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.

@ -154,7 +154,7 @@ Table 2: Prediction performance (mAP@0.5-0.95) of object detection networks on D

 |                | human   | MRCNN   | MRCNN   | FRCNN   | YOLO   |
 |----------------|---------|---------|---------|---------|--------|
-|                | human   | R50     | R101    | R101    | v5x6   |
+|                |         | R50     | R101    | R101    | v5x6   |
 | Caption        | 84-89   | 68.4    | 71.5    | 70.1    | 77.7   |
 | Footnote       | 83-91   | 70.9    | 71.8    | 73.7    | 77.2   |
 | Formula        | 83-85   | 60.1    | 63.4    | 63.5    | 66.2   |
@ -246,17 +246,17 @@ Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network acros
 | Training on     | labels     | PLN          | DB           | DLN          |
 | PubLayNet (PLN) | Figure     | 96           | 43           | 23           |
 | PubLayNet (PLN) | Sec-header | 87           | -            | 32           |
-| PubLayNet (PLN) | Table      | 95           | 24           | 49           |
-| PubLayNet (PLN) | Text       | 96           | -            | 42           |
-| PubLayNet (PLN) | total      | 93           | 34           | 30           |
+|                 | Table      | 95           | 24           | 49           |
+|                 | Text       | 96           | -            | 42           |
+|                 | total      | 93           | 34           | 30           |
 | DocBank (DB)    | Figure     | 77           | 71           | 31           |
 | DocBank (DB)    | Table      | 19           | 65           | 22           |
 | DocBank (DB)    | total      | 48           | 68           | 27           |
 | DocLayNet (DLN) | Figure     | 67           | 51           | 72           |
 | DocLayNet (DLN) | Sec-header | 53           | -            | 68           |
-| DocLayNet (DLN) | Table      | 87           | 43           | 82           |
-| DocLayNet (DLN) | Text       | 77           | -            | 84           |
-| DocLayNet (DLN) | total      | 59           | 47           | 78           |
+|                 | Table      | 87           | 43           | 82           |
+|                 | Text       | 77           | -            | 84           |
+|                 | total      | 59           | 47           | 78           |

 Section-header , Table and Text . Before training, we either mapped or excluded DocLayNet's other labels as specified in table 3, and also PubLayNet's List to Text . Note that the different clustering of lists (by list-element vs. whole list objects) naturally decreases the mAP score for Text .

--- a/tests/data/groundtruth/docling_v2/2206.01062.pages.json
+++ b/tests/data/groundtruth/docling_v2/2206.01062.pages.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.doctags.txt
@ -3,7 +3,7 @@
 <text><loc_110><loc_74><loc_393><loc_97>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
 <section_header_level_1><loc_110><loc_105><loc_260><loc_113>5.1 Hyper Parameter Optimization</section_header_level_1>
 <text><loc_110><loc_116><loc_393><loc_161>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
-<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><ecel><ecel><fcel>OTSL<fcel>0.952 0.923<fcel>0.909<fcel>0.938<fcel>0.843<fcel>3.77<nl><fcel>2<fcel>4<fcel>HTML<fcel>0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
+<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_172><loc_393><loc_207>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
 <section_header_level_1><loc_110><loc_319><loc_216><loc_327>5.2 Quantitative Results</section_header_level_1>
 <text><loc_110><loc_330><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
 <text><loc_110><loc_390><loc_393><loc_421>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.md
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.md
@ -6,14 +6,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly

 Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

-| #          | #          | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
-|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
-| enc-layers | dec-layers | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
-| 6          | 6          | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
-| 4          | 4          | OTSL HTML  | 0.938       | 0.904       | 0.927       | 0.853       | 1.97        |
-|            |            | OTSL       | 0.952 0.923 | 0.909       | 0.938       | 0.843       | 3.77        |
-| 2          | 4          | HTML       | 0.945       | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
-| 4          | 2          | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |
+| # enc-layers   | # dec-layers   | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
+|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
+| # enc-layers   | # dec-layers   | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
+| 6              | 6              | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
+| 4              | 4              | OTSL HTML  | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77   |
+| 2              | 4              | OTSL HTML  | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
+| 4              | 2              | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |

 ## 5.2 Quantitative Results

--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.doctags.txt
@ -89,14 +89,14 @@
 <text><loc_110><loc_75><loc_393><loc_96>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
 <section_header_level_1><loc_110><loc_107><loc_260><loc_112>5.1 Hyper Parameter Optimization</section_header_level_1>
 <text><loc_110><loc_117><loc_393><loc_160>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
-<otsl><loc_114><loc_213><loc_388><loc_296><ched>#<ched>#<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ched>enc-layers<ched>dec-layers<ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><fcel>2<fcel>4<fcel>OTSL<fcel>0.923 0.945<fcel>0.909 0.897<fcel>0.938<fcel>0.843<fcel>3.77<nl><ecel><ecel><fcel>HTML<ecel><fcel>0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
+<otsl><loc_114><loc_213><loc_388><loc_296><ched># enc-layers<ched># dec-layers<ched>Language<ched>TEDs<lcel><lcel><ched>mAP<ched>Inference<nl><ucel><ucel><ucel><ched>simple<ched>complex<ched>all<ched>(0.75)<ched>time (secs)<nl><fcel>6<fcel>6<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>4<fcel>4<fcel>OTSL HTML<fcel>0.938 0.952<fcel>0.904 0.909<fcel>0.927 0.938<fcel>0.853 0.843<fcel>1.97 3.77<nl><fcel>2<fcel>4<fcel>OTSL HTML<fcel>0.923 0.945<fcel>0.897 0.901<fcel>0.915 0.931<fcel>0.859 0.834<fcel>1.91 3.81<nl><fcel>4<fcel>2<fcel>OTSL HTML<fcel>0.952 0.944<fcel>0.92 0.903<fcel>0.942 0.931<fcel>0.857 0.824<fcel>1.22 2<nl><caption><loc_110><loc_174><loc_393><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
 <section_header_level_1><loc_110><loc_321><loc_216><loc_326>5.2 Quantitative Results</section_header_level_1>
 <text><loc_110><loc_331><loc_393><loc_390>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
 <text><loc_110><loc_392><loc_393><loc_420>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
 <page_break>
 <page_header><loc_110><loc_59><loc_118><loc_64>10</page_header>
 <page_header><loc_137><loc_59><loc_189><loc_64>M. Lysak, et al.</page_header>
-<otsl><loc_117><loc_99><loc_385><loc_166><ecel><ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ecel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><rhed>PubTabNet<rhed>OTSL<fcel>0.965<fcel>0.934<fcel>0.955<fcel>0.88<fcel>2.73<nl><ucel><rhed>HTML<fcel>0.969<fcel>0.927<fcel>0.955<fcel>0.857<fcel>5.39<nl><rhed>FinTabNet<rhed>OTSL<fcel>0.955<fcel>0.961<fcel>0.959<fcel>0.862<fcel>1.85<nl><ucel><rhed>HTML<fcel>0.917<fcel>0.922<fcel>0.92<fcel>0.722<fcel>3.26<nl><rhed>PubTables-1M<rhed>OTSL<fcel>0.987<fcel>0.964<fcel>0.977<fcel>0.896<fcel>1.79<nl><ucel><rhed>HTML<fcel>0.983<fcel>0.944<fcel>0.966<fcel>0.889<fcel>3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
+<otsl><loc_117><loc_99><loc_385><loc_166><ched>Data set<ched>Language<ched>TEDs<lcel><lcel><ched>mAP(0.75)<ched>Inference time (secs)<nl><ucel><ucel><ched>simple<ched>complex<ched>all<ucel><ucel><nl><fcel>PubTabNet<fcel>OTSL HTML<fcel>0.965 0.969<fcel>0.934 0.927<fcel>0.955 0.955<fcel>0.88 0.857<fcel>2.73 5.39<nl><fcel>FinTabNet<fcel>OTSL HTML<fcel>0.955 0.917<fcel>0.961 0.922<fcel>0.959 0.92<fcel>0.862 0.722<fcel>1.85 3.26<nl><fcel>PubTables-1M<fcel>OTSL HTML<fcel>0.987 0.983<fcel>0.964 0.944<fcel>0.977 0.966<fcel>0.896 0.889<fcel>1.79 3.26<nl><caption><loc_110><loc_73><loc_393><loc_92>Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).</caption></otsl>
 <section_header_level_1><loc_110><loc_182><loc_210><loc_188>5.3 Qualitative Results</section_header_level_1>
 <text><loc_110><loc_196><loc_393><loc_231>To illustrate the qualitative differences between OTSL and HTML, Figure 5 demonstrates less overlap and more accurate bounding boxes with OTSL. In Figure 6, OTSL proves to be more effective in handling tables with longer token sequences, resulting in even more precise structure prediction and bounding boxes.</text>
 <picture><loc_133><loc_281><loc_369><loc_419><caption><loc_110><loc_251><loc_393><loc_278>Fig. 5. The OTSL model produces more accurate bounding boxes with less overlap (E) than the HTML model (D), when predicting the structure of a sparse table (A), at twice the inference speed because of shorter sequence length (B),(C). "PMC2807444_006_00.png" PubTabNet. μ</caption></picture>
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.json
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.md
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.md
@ -126,14 +126,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly

 Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

-| #          | #          | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
-|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
-| enc-layers | dec-layers | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
-| 6          | 6          | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
-| 4          | 4          | OTSL HTML  | 0.938 0.952 | 0.904       | 0.927       | 0.853       | 1.97        |
-| 2          | 4          | OTSL       | 0.923 0.945 | 0.909 0.897 | 0.938       | 0.843       | 3.77        |
-|            |            | HTML       |             | 0.901       | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
-| 4          | 2          | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |
+| # enc-layers   | # dec-layers   | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
+|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
+| # enc-layers   | # dec-layers   | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
+| 6              | 6              | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
+| 4              | 4              | OTSL HTML  | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77   |
+| 2              | 4              | OTSL HTML  | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
+| 4              | 2              | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |

 ## 5.2 Quantitative Results

@ -143,15 +142,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied

 Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).

-|              | Language   | TEDs   | TEDs    | TEDs   | mAP(0.75)   | Inference time (secs)   |
-|--------------|------------|--------|---------|--------|-------------|-------------------------|
-|              | Language   | simple | complex | all    | mAP(0.75)   | Inference time (secs)   |
-| PubTabNet    | OTSL       | 0.965  | 0.934   | 0.955  | 0.88        | 2.73                    |
-| PubTabNet    | HTML       | 0.969  | 0.927   | 0.955  | 0.857       | 5.39                    |
-| FinTabNet    | OTSL       | 0.955  | 0.961   | 0.959  | 0.862       | 1.85                    |
-| FinTabNet    | HTML       | 0.917  | 0.922   | 0.92   | 0.722       | 3.26                    |
-| PubTables-1M | OTSL       | 0.987  | 0.964   | 0.977  | 0.896       | 1.79                    |
-| PubTables-1M | HTML       | 0.983  | 0.944   | 0.966  | 0.889       | 3.26                    |
+| Data set     | Language   | TEDs        | TEDs        | TEDs        | mAP(0.75)   | Inference time (secs)   |
+|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
+| Data set     | Language   | simple      | complex     | all         | mAP(0.75)   | Inference time (secs)   |
+| PubTabNet    | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39               |
+| FinTabNet    | OTSL HTML  | 0.955 0.917 | 0.961 0.922 | 0.959 0.92  | 0.862 0.722 | 1.85 3.26               |
+| PubTables-1M | OTSL HTML  | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26               |

 ## 5.3 Qualitative Results

--- a/tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.pages.json
--- a/tests/data/groundtruth/docling_v2/redp5110_sampled.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/redp5110_sampled.doctags.txt
--- a/tests/data/groundtruth/docling_v2/redp5110_sampled.json
+++ b/tests/data/groundtruth/docling_v2/redp5110_sampled.json
--- a/tests/data/groundtruth/docling_v2/redp5110_sampled.md
+++ b/tests/data/groundtruth/docling_v2/redp5110_sampled.md
@ -10,50 +10,49 @@ Front cover

 ## Contents

-| Notices                                                                                                                                                                   | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii   |
-|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
-| Trademarks                                                                                                                                                                | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii          |
-| DB2 for i Center of Excellence                                                                                                                                            | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix                                          |
-| Preface                                                                                                                                                                   | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi    |
-| Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi                            |                                                                                                                                         |
-| Now you can become a published author, too!                                                                                                                               | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii                                                                |
-| Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                             | xiii                                                                                                                                    |
-| Stay connected to IBM Redbooks                                                                                                                                            | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv                                             |
-| Chapter 1. Securing and protecting IBM DB2 data  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                | 1                                                                                                                                       |
-| 1.1 Security fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2                                          |                                                                                                                                         |
-| 1.2 Current state of IBM i security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                             | 2                                                                                                                                       |
-| 1.3 DB2 for i security controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3                                         |                                                                                                                                         |
-| 1.3.1 Existing row and column control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                       | 4                                                                                                                                       |
-| 1.3.2 New controls: Row and Column Access Control. . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                    | 5                                                                                                                                       |
-| Chapter 2. Roles and separation of duties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                         | 7                                                                                                                                       |
-| 2.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                 | 8                                                                                                                                       |
-| 2.1.1 DDM and DRDA application server access: QIBM\_DB\_DDMDRDA . . . . . . . . . . .                                                                                       | 8                                                                                                                                       |
-| 2.1.2 Toolbox application server access: QIBM\_DB\_ZDA. . . . . . . . . . . . . . . . . . . . . . . .                                                                       | 8                                                                                                                                       |
-| 2.1.3 Database Administrator function: QIBM\_DB\_SQLADM . . . . . . . . . . . . . . . . . . . . .                                                                           | 9                                                                                                                                       |
-| 2.1.4 Database Information function: QIBM\_DB\_SYSMON                                                                                                                       | . . . . . . . . . . . . . . . . . . . . . . 9                                                                                           |
-| 2.1.5 Security Administrator function: QIBM\_DB\_SECADM . . . . . . . . . . . . . . . . . . . . . .                                                                         | 9                                                                                                                                       |
-| 2.1.6 Change Function Usage CL command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                  | 10                                                                                                                                      |
-| 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION\_USAGE view . . . . .                                                                                        | 10                                                                                                                                      |
-| 2.2 Separation of duties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10                                         |                                                                                                                                         |
-| Chapter 3. Row and Column Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                | 13                                                                                                                                      |
-| 3.1 Explanation of RCAC and the concept of access control . . . . . . . . . . . . . . . . . . . . . . .                                                                   | 14                                                                                                                                      |
-| 3.1.1 Row permission and column mask definitions                                                                                                                          | . . . . . . . . . . . . . . . . . . . . . . . . . . . 14                                                                                |
-| 3.1.2 Enabling and activating RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                          | 16                                                                                                                                      |
-| 3.2 Special registers and built-in global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                       | 18                                                                                                                                      |
-| 3.2.1 Special registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                               | 18                                                                                                                                      |
-| 3.2.2 Built-in global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 | 19                                                                                                                                      |
-| 3.3 VERIFY\_GROUP\_FOR\_USER function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                | 20                                                                                                                                      |
-| 3.4 Establishing and controlling accessibility by using the RCAC rule text . . . . . . . . . . . . .                                                                      | 21                                                                                                                                      |
-| . . . . . . . . . . . . . . . . . . . . . . . .                                                                                                                           | 22                                                                                                                                      |
-| 3.5 SELECT, INSERT, and UPDATE behavior with RCAC 3.6 Human resources example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 22                                                                                                                                      |
-| 3.6.1 Assigning the QIBM\_DB\_SECADM function ID to the consultants. . . . . . . . . . . .                                                                                  | 23                                                                                                                                      |
-| 3.6.2 Creating group profiles for the users and their roles . . . . . . . . . . . . . . . . . . . . . . .                                                                 | 23                                                                                                                                      |
-| 3.6.3 Demonstrating data access without RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                  | 24                                                                                                                                      |
-| 3.6.4 Defining and creating row permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                             | 25                                                                                                                                      |
-| 3.6.5 Defining and creating column masks                                                                                                                                  | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26                                                                  |
-| 3.6.6 Activating RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 | 28                                                                                                                                      |
-| 3.6.7 Demonstrating data access with RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                 | 29                                                                                                                                      |
-| 3.6.8 Demonstrating data access with a view and RCAC . . . . . . . . . . . . . . . . . . . . . . .                                                                        | 32                                                                                                                                      |
+| Notices                                                                                                                                                       | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii   |
+|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
+| Trademarks                                                                                                                                                    | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii          |
+| DB2 for i Center of Excellence                                                                                                                                | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix                                          |
+| Preface                                                                                                                                                       | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi    |
+| Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi                |                                                                                                                                         |
+| Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                               | xiii                                                                                                                                    |
+| Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                 | xiii                                                                                                                                    |
+| Stay connected to IBM Redbooks                                                                                                                                | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv                                             |
+| Chapter 1. Securing and protecting IBM DB2 data                                                                                                               | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1                                                                             |
+| 1.1 Security fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2                              |                                                                                                                                         |
+| 1.2 Current state of IBM i security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                 | 2                                                                                                                                       |
+| 1.3 DB2 for i security controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3                             |                                                                                                                                         |
+| 1.3.1 Existing row and column control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                           | 4                                                                                                                                       |
+| 1.3.2 New controls: Row and Column Access Control. . . . . . . . . . . . . . . . . . . . . . . . . . .                                                        | 5                                                                                                                                       |
+| Chapter 2. Roles and separation of duties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                             | 7                                                                                                                                       |
+| 2.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     | 8                                                                                                                                       |
+| 2.1.1 DDM and DRDA application server access: QIBM\_DB\_DDMDRDA . . . . . . . . . . .                                                                           | 8                                                                                                                                       |
+| 2.1.2 Toolbox application server access: QIBM\_DB\_ZDA. . . . . . . . . . . . . . . . . . . . . . . .                                                           | 8                                                                                                                                       |
+| 2.1.3 Database Administrator function: QIBM\_DB\_SQLADM . . . . . . . . . . . . . . . . . . . . .                                                               | 9                                                                                                                                       |
+| 2.1.4 Database Information function: QIBM\_DB\_SYSMON                                                                                                           | . . . . . . . . . . . . . . . . . . . . . . 9                                                                                           |
+| 2.1.5 Security Administrator function: QIBM\_DB\_SECADM . . . . . . . . . . . . . . . . . . . . . .                                                             | 9                                                                                                                                       |
+| 2.1.6 Change Function Usage CL command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                      | 10                                                                                                                                      |
+| 2.1.7 Verifying function usage IDs for RCAC with the FUNCTION\_USAGE view . . . . .                                                                            | 10                                                                                                                                      |
+| 2.2 Separation of duties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10                             |                                                                                                                                         |
+| Chapter 3. Row and Column Access Control                                                                                                                      | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13                                                                    |
+| 3.1 Explanation of RCAC and the concept of access control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 14                                                                                                                                      |
+| 3.1.1 Row permission and column mask definitions                                                                                                              | 14                                                                                                                                      |
+| 3.1.2 Enabling and activating RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                              | 16                                                                                                                                      |
+| 3.2 Special registers and built-in global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                           | 18                                                                                                                                      |
+| 3.2.1 Special registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                   | 18                                                                                                                                      |
+| 3.2.2 Built-in global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                     | 19                                                                                                                                      |
+| 3.3 VERIFY\_GROUP\_FOR\_USER function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                    | 20                                                                                                                                      |
+| 3.4 Establishing and controlling accessibility by using the RCAC rule text . . . . . . . . . . . . .                                                          | 21                                                                                                                                      |
+| Human resources example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                           |                                                                                                                                         |
+| 3.6                                                                                                                                                           | 22                                                                                                                                      |
+| 3.6.1 Assigning the QIBM\_DB\_SECADM function ID to the consultants. . . . . . . . . . . .                                                                      | 23 23                                                                                                                                   |
+| 3.6.2 Creating group profiles for the users and their roles . . . . . . . . . . . . . . . . . . . . . . .                                                     |                                                                                                                                         |
+| 3.6.3 Demonstrating data access without RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                      | 24                                                                                                                                      |
+| 3.6.4 Defining and creating row permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 | 25                                                                                                                                      |
+| 3.6.5 Defining and creating column masks                                                                                                                      | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26                                                                  |
+| 3.6.6 Activating RCAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                     | 28                                                                                                                                      |
+| 3.6.8 Demonstrating data access with a view and RCAC . . . . . . . . . . . . . . . . . . . . . . .                                                            | 32                                                                                                                                      |

 DB2 for i Center of Excellence

--- a/tests/data/groundtruth/docling_v2/redp5110_sampled.pages.json
+++ b/tests/data/groundtruth/docling_v2/redp5110_sampled.pages.json