{ "_name": "", "type": "pdf-document", "description": { "title": null, "abstract": null, "authors": null, "affiliations": null, "subjects": null, "keywords": null, "publication_date": null, "languages": null, "license": null, "publishers": null, "url_refs": null, "references": null, "publication": null, "reference_count": null, "citation_count": null, "citation_date": null, "advanced": null, "analytics": null, "logs": [], "collection": null, "acquisition": null }, "file-info": { "filename": "2305.03393v1.pdf", "filename-prov": null, "document-hash": "c98927fda1ef9b66a4c3a236a65dc0cdf5c129be4122cdb58eaa3a37e3241eae", "#-pages": 14, "collection-name": null, "description": null, "page-hashes": [ { "hash": "f09df98501fbcd8a2b359e4686187b56b7d82f3eb312cbbb23f61661691ecbf9", "model": "default", "page": 1 }, { "hash": "6d26558563949e376cdb8dcb12a7288ec12d4c513de04616238aadcd15255d28", "model": "default", "page": 2 }, { "hash": "4ef8043e938e362a06bc7f88f0b02df95d95cbfc891f544b7f88a448e53fb689", "model": "default", "page": 3 }, { "hash": "8b755c3cd938ebf88bf14db6103c999794b0ca0c6f591f47a0c902b111159fe6", "model": "default", "page": 4 }, { "hash": "95582f3138775a800969e873ad2e4eafca4f1d1de7b9b14ad826bbe8a17fe302", "model": "default", "page": 5 }, { "hash": "619ab9fe3258434818f86df106cb76ed1fc8ab9800cbd91444098e91f7e67d8b", "model": "default", "page": 6 }, { "hash": "c02e90eed528fcb71d0657183903b3e2035b86e3e750fb579f8c1f1e09aa132d", "model": "default", "page": 7 }, { "hash": "b56262de55611de4494b0ed5011ce9567fada7c99bf53c5ff6c689ad9f941730", "model": "default", "page": 8 }, { "hash": "680962e4a1193f15a591c82e1be59c0ff4cc78a066aeaaccad41f9262c67197b", "model": "default", "page": 9 }, { "hash": "37dca86674661a5845a3bbd2fabb4a497cf2b5fc4908fd28dd63296c4fbee075", "model": "default", "page": 10 }, { "hash": "0e3c057d1d7e6b359d73d4a44597879b2d421097da9aeb18ea581b32666ce740", "model": "default", "page": 11 }, { "hash": "ec343c5522af29f238bde237ca655cdc018c5db20fb099c15ce8bc5045ce8593", "model": "default", "page": 12 }, { "hash": "4ffa1d69b1366de506ca77c25a021790c3c150791fc830d6f4c85c3846efe6a9", "model": "default", "page": 13 }, { "hash": "9fd62e0449eaf680e49767b4c512d8172cd3586480344318dc7e1cb0964b4d18", "model": "default", "page": 14 } ] }, "main-text": [ { "prov": [ { "bbox": [ 134.765, 645.4859, 480.59735, 676.10089 ], "page": 1, "span": [ 0, 60 ], "__ref_s3_data": null } ], "text": "Optimized Table Tokenization for Table Structure Recognition", "type": "subtitle-level-1", "payload": null, "name": "Section-header", "font": null }, { "prov": [ { "bbox": [ 139.34305, 591.81409, 476.01270000000005, 622.30841 ], "page": 1, "span": [ 0, 222 ], "__ref_s3_data": null } ], "text": "Maksym Lysak [0000 \u2212 0002 \u2212 3723 \u2212 $^{6960]}$, Ahmed Nassar[0000 \u2212 0002 \u2212 9468 \u2212 $^{0822]}$, Nikolaos Livathinos [0000 \u2212 0001 \u2212 8513 \u2212 $^{3491]}$, Christoph Auer[0000 \u2212 0001 \u2212 5761 \u2212 $^{0422]}$, [0000 \u2212 0002 \u2212 8088 \u2212 0823]", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 229.52109000000002, 587.61926, 298.6087, 596.41626 ], "page": 1, "span": [ 0, 15 ], "__ref_s3_data": null } ], "text": "and Peter Staar", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 279.1051, 566.72632, 336.25153, 574.79602 ], "page": 1, "span": [ 0, 12 ], "__ref_s3_data": null } ], "text": "IBM Research", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 222.96609, 555.72247, 392.38983, 563.19147 ], "page": 1, "span": [ 0, 36 ], "__ref_s3_data": null } ], "text": "{mly,ahn,nli,cau,taa}@zurich.ibm.com", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 163.1111, 327.26553, 452.24878000000007, 521.69885 ], "page": 1, "span": [ 0, 1198 ], "__ref_s3_data": null } ], "text": "Abstract. Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. Popular table structure data-sets will be published in OTSL format to the community.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 163.1111, 294.21451, 452.24158, 313.30606 ], "page": 1, "span": [ 0, 90 ], "__ref_s3_data": null } ], "text": "Keywords: Table Structure Recognition \u00b7 Data Representation \u00b7 Transformers \u00b7 Optimization.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76512, 259.31192, 228.93384, 269.88031 ], "page": 1, "span": [ 0, 14 ], "__ref_s3_data": null } ], "text": "1 Introduction", "type": "subtitle-level-1", "payload": null, "name": "Section-header", "font": null }, { "prov": [ { "bbox": [ 134.76512, 163.18548999999996, 480.5959500000001, 243.71345999999994 ], "page": 1, "span": [ 0, 500 ], "__ref_s3_data": null } ], "text": "Tables are ubiquitous in documents such as scientific papers, patents, reports, manuals, specification sheets or marketing material. They often encode highly valuable information and therefore need to be extracted with high accuracy. Unfortunately, tables appear in documents in various sizes, styling and structure, making it difficult to recover their correct structure with simple analytical methods. Therefore, accurate table extraction is achieved these days with machine-learning based methods.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76512, 127.14547000000005, 480.59583, 159.85244999999998 ], "page": 1, "span": [ 0, 235 ], "__ref_s3_data": null } ], "text": "In modern document understanding systems [1,15], table extraction is typically a two-step process. Firstly, every table on a page is located with a bounding box, and secondly, their logical row and column structure is recognized. As of", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "name": "Picture", "type": "figure", "$ref": "#/figures/0" }, { "prov": [ { "bbox": [ 134.765, 591.77942, 480.59189, 665.66583 ], "page": 2, "span": [ 0, 574 ], "__ref_s3_data": null } ], "text": "Fig. 1. Comparison between HTML and OTSL table structure representation: (A) table-example with complex row and column headers, including a 2D empty span, (B) minimal graphical representation of table structure using rectangular layout, (C) HTML representation, (D) OTSL representation. This example demonstrates many of the key-features of OTSL, namely its reduced vocabulary size (12 versus 5 in this case), its reduced sequence length (55 versus 30) and a enhanced internal structure (variable token sequence length per row in HTML versus a fixed length of rows in OTSL).", "type": "caption", "payload": null, "name": "Caption", "font": null }, { "prov": [ { "bbox": [ 134.765, 271.11330999999996, 480.59232000000003, 339.68622 ], "page": 2, "span": [ 0, 435 ], "__ref_s3_data": null } ], "text": "today, table detection in documents is a well understood problem, and the latest state-of-the-art (SOTA) object detection methods provide an accuracy comparable to human observers [7,8,10,14,23]. On the other hand, the problem of table structure recognition (TSR) is a lot more challenging and remains a very active area of research, in which many novel machine learning algorithms are being explored [3,4,5,9,11,12,13,14,17,18,21,22].", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76501, 127.14530000000002, 480.59482, 267.44928000000004 ], "page": 2, "span": [ 0, 911 ], "__ref_s3_data": null } ], "text": "Recently emerging SOTA methods for table structure recognition employ transformer-based models, in which an image of the table is provided to the network in order to predict the structure of the table as a sequence of tokens. These image-to-sequence (Im2Seq) models are extremely powerful, since they allow for a purely data-driven solution. The tokens of the sequence typically belong to a markup language such as HTML, Latex or Markdown, which allow to describe table structure as rows, columns and spanning cells in various configurations. In Figure 1, we illustrate how HTML is used to represent the table-structure of a particular example table. Public table-structure data sets such as PubTabNet [22], and FinTabNet [21], which were created in a semi-automated way from paired PDF and HTML sources (e.g. PubMed Central), popularized primarily the use of HTML as ground-truth representation format for TSR.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76498, 580.58313, 480.59183, 673.06622 ], "page": 3, "span": [ 0, 584 ], "__ref_s3_data": null } ], "text": "While the majority of research in TSR is currently focused on the development and application of novel neural model architectures, the table structure representation language (e.g. HTML in PubTabNet and FinTabNet) is usually adopted as is for the sequence tokenization in Im2Seq models. In this paper, we aim for the opposite and investigate the impact of the table structure representation language with an otherwise unmodified Im2Seq transformer-based architecture. Since the current state-of-the-art Im2Seq model is TableFormer [9], we select this model to perform our experiments.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76498, 460.77014, 480.59572999999995, 577.16412 ], "page": 3, "span": [ 0, 721 ], "__ref_s3_data": null } ], "text": "The main contribution of this paper is the introduction of a new optimised table structure language (OTSL), specifically designed to describe table-structure in an compact and structured way for Im2Seq models. OTSL has a number of key features, which make it very attractive to use in Im2Seq models. Specifically, compared to other languages such as HTML, OTSL has a minimized vocabulary which yields short sequence length, strong inherent structure (e.g. strict rectangular layout) and a strict syntax with rules that only look backwards. The latter allows for syntax validation during inference and ensures a syntactically correct table-structure. These OTSL features are illustrated in Figure 1, in comparison to HTML.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76498, 352.91324, 480.59567, 457.35211 ], "page": 3, "span": [ 0, 626 ], "__ref_s3_data": null } ], "text": "The paper is structured as follows. In section 2, we give an overview of the latest developments in table-structure reconstruction. In section 3 we review the current HTML table encoding (popularised by PubTabNet and FinTabNet) and discuss its flaws. Subsequently, we introduce OTSL in section 4, which includes the language definition, syntax rules and error-correction procedures. In section 5, we apply OTSL on the TableFormer architecture, compare it to TableFormer models trained on HTML and ultimately demonstrate the advantages of using OTSL. Finally, in section 6 we conclude our work and outline next potential steps.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76498, 319.34366, 236.76912999999996, 329.91205 ], "page": 3, "span": [ 0, 14 ], "__ref_s3_data": null } ], "text": "2 Related Work", "type": "subtitle-level-1", "payload": null, "name": "Section-header", "font": null }, { "prov": [ { "bbox": [ 134.76498, 127.14423, 484.12047999999993, 303.31418 ], "page": 3, "span": [ 0, 1161 ], "__ref_s3_data": null } ], "text": "Approaches to formalize the logical structure and layout of tables in electronic documents date back more than two decades [16]. In the recent past, a wide variety of computer vision methods have been explored to tackle the problem of table structure recognition, i.e. the correct identification of columns, rows and spanning cells in a given table. Broadly speaking, the current deeplearning based approaches fall into three categories: object detection (OD) methods, Graph-Neural-Network (GNN) methods and Image-to-Markup-Sequence (Im2Seq) methods. Object-detection based methods [11,12,13,14,21] rely on tablestructure annotation using (overlapping) bounding boxes for training, and produce bounding-box predictions to define table cells, rows, and columns on a table image. Graph Neural Network (GNN) based methods [3,6,17,18], as the name suggests, represent tables as graph structures. The graph nodes represent the content of each table cell, an embedding vector from the table image, or geometric coordinates of the table cell. The edges of the graph define the relationship between the nodes, e.g. if they belong to the same column, row, or table cell.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.765, 532.76208, 480.5957599999999, 673.06622 ], "page": 4, "span": [ 0, 939 ], "__ref_s3_data": null } ], "text": "Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.", "type": "paragraph", "payload": null, "name": "Text", "font": null }, { "prov": [ { "bbox": [ 134.76498, 305.3533, 480.59569999999997, 529.34308 ], "page": 4, "span": [ 0, 1404 ], "__ref_s3_data": null } ], "text": "Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell (