feat: new torch-based docling models (#120)

--------- Signed-off-by: Maxim Lysak <mly@zurich.ibm.com> Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
parent 9ebbbc1245
commit 2422f706a1
30 changed files with 1159 additions and 1185 deletions
--- a/tests/data/2203.01017v2.md
+++ b/tests/data/2203.01017v2.md
@@ -1,6 +1,6 @@
 ## TableFormer: Table Structure Understanding with Transformers.

-## Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research
+Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research

 { ahn,nli,mly,taa } @zurich.ibm.com

@@ -27,12 +27,12 @@ b. Red-annotation of bounding boxes, Blue-predictions by TableFormer
 c. Structure predicted by TableFormer:

 Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
-| 0   |   1 | 1   |   2 1 |   2 1 |    |
-|-----|-----|-----|-------|-------|----|
-| 3   |   4 | 5 3 |     6 |     7 |    |
-| 8   |   9 | 10  |    11 |    12 | 2  |
-|     |  13 | 14  |    15 |    16 | 2  |
-|     |  17 | 18  |    19 |    20 | 2  |
+| 0   |   1 | 1   |   2 1 | 2 1   |   2 1 |
+|-----|-----|-----|-------|-------|-------|
+| 3   |   4 | 5 3 |     6 |       |     7 |
+| 8   |   9 | 10  |    11 | 12    |    16 |
+| 2   |  13 | 14  |    15 |       |    16 |
+|     |  17 | 18  |    19 | 20    |    16 |

 Recently, significant progress has been made with vision based approaches to extract tables in documents. For the sake of completeness, the issue of table extraction from documents is typically decomposed into two separate challenges, i.e. (1) finding the location of the table(s) on a document-page and (2) finding the structure of a given table in the document.

@@ -179,7 +179,7 @@ where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDi

 Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.

-Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).
+Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN). FT: Model was trained on PubTabNet then finetuned.
 | Model       | Dataset   | Simple   | TEDS Complex   |   All |
 |-------------|-----------|----------|----------------|-------|
 | EDD         | PTN       | 91.1     | 88.7           | 89.9  |
@@ -193,8 +193,6 @@ Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) a
 | TableFormer | TB        | 89.6     | -              | 89.6  |
 | TableFormer | STN       | 96.9     | 95.7           | 96.7  |

-FT: Model was trained on PubTabNet then finetuned.
-
 Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate

 our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.
@@ -220,46 +218,10 @@ Table 4: Results of structure with content retrieved using cell detection on Pub

 a. Red - PDF cells, Green - predicted bounding boxes, Blue - post-processed predictions matched to PDF cells

-Japanese language (previously unseen by TableFormer):
-
-Example table from FinTabNet:
-
-
-<!-- image -->
-
-
-<!-- image -->
-
-b. Structure predicted by TableFormer, with superimposed matched PDF cell text:
-
-
-|                                                    |             | 論文ファイル   | 論文ファイル   | 参考文献   | 参考文献   |
-|----------------------------------------------------|-------------|----------------|----------------|------------|------------|
-| 出典                                               | ファイル 数 | 英語           | 日本語         | 英語       | 日本語     |
-| Association for Computational Linguistics(ACL2003) | 65          | 65             | 0              | 150        | 0          |
-| Computational Linguistics(COLING2002)              | 140         | 140            | 0              | 150        | 0          |
-| 電気情報通信学会 2003 年総合大会                   | 150         | 8              | 142            | 223        | 147        |
-| 情報処理学会第 65 回全国大会 (2003)                | 177         | 1              | 176            | 150        | 236        |
-| 第 17 回人工知能学会全国大会 (2003)                | 208         | 5              | 203            | 152        | 244        |
-| 自然言語処理研究会第 146 〜 155 回                 | 98          | 2              | 96             | 150        | 232        |
-| WWW から収集した論文                               | 107         | 73             | 34             | 147        | 96         |
-|                                                    | 945         | 294            | 651            | 1122       | 955        |
-
 Text is aligned to match original for ease of viewing
-|                          | Shares (in millions)   | Shares (in millions)   | Weighted Average Grant Date Fair Value   | Weighted Average Grant Date Fair Value   |
-|--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
-|                          | RS U s                 | PSUs                   | RSUs                                     | PSUs                                     |
-| Nonvested on Janua ry 1  | 1. 1                   | 0.3                    | 90.10 $                                  | $ 91.19                                  |
-| Granted                  | 0. 5                   | 0.1                    | 117.44                                   | 122.41                                   |
-| Vested                   | (0. 5 )                | (0.1)                  | 87.08                                    | 81.14                                    |
-| Canceled or forfeited    | (0. 1 )                | -                      | 102.01                                   | 92.18                                    |
-| Nonvested on December 31 | 1.0                    | 0.3                    | 104.85 $                                 | $ 104.51                                 |
+<!-- image -->

 Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.
-<!-- image -->
-
-
-<!-- image -->

 Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.
 <!-- image -->
@@ -356,9 +318,7 @@ and evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael F

 [38] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR) , pages 1015-1022, 2019. 1

-## TableFormer: Table Structure Understanding with Transformers
-
-Supplementary Material
+## TableFormer: Table Structure Understanding with Transformers Supplementary Material

 ## 1. Details on the datasets

@@ -437,13 +397,16 @@ dian cell size for all table cells. The usage of median during the computations,

 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-

-## phan cell.
+phan cell.

 9f. Otherwise create a new structural cell and match it wit the orphan cell.

 Aditional images with examples of TableFormer predictions and post-processing can be found below.

 Figure 8: Example of a table with multi-line header.
+<!-- image -->
+
+Figure 9: Example of a table with big empty distance between cells.

 Figure 9: Example of a table with big empty distance between cells.
 <!-- image -->
@@ -451,28 +414,21 @@ Figure 9: Example of a table with big empty distance between cells.
 Figure 10: Example of a complex table with empty cells.
 <!-- image -->

-
+Figure 14: Example with multi-line text.
 <!-- image -->

 Figure 11: Simple table with different style and empty cells.
 <!-- image -->

 Figure 12: Simple table predictions and post processing.
-<!-- image -->

 Figure 13: Table predictions example on colorful table.
-
-Figure 14: Example with multi-line text.
-<!-- image -->
-
-Figure 16: Example of how post-processing helps to restore mis-aligned bounding boxes prediction artifact.
-<!-- image -->
-
-
 <!-- image -->

 Figure 15: Example with triangular table.
 <!-- image -->

+Figure 16: Example of how post-processing helps to restore mis-aligned bounding boxes prediction artifact.
+
 Figure 17: Example of long table. End-to-end example from initial PDF cells to prediction of bounding boxes, post processing and prediction of structure.
 <!-- image -->