Files
Docling/tests/data/groundtruth/docling_v2/code_and_formula.json
Maxim Lysak 6e75f0b5d3 fix: Revise DocTags, fix iterate_items to output content_layer in items (#965)
* Testing fix for docling-core dt

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* fix: Fix code_formula test unit, update test-cases

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix code-formula model for new docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Update fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test cases for office formats

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update deps and lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-17 14:11:55 +01:00

1 line
17 KiB
JSON

{"schema_name": "DoclingDocument", "version": "1.1.0", "name": "code_and_formula", "origin": {"mimetype": "application/pdf", "binary_hash": 8967166443255744998, "filename": "code_and_formula.pdf", "uri": null}, "furniture": {"self_ref": "#/furniture", "parent": null, "children": [], "content_layer": "furniture", "name": "_root_", "label": "unspecified"}, "body": {"self_ref": "#/body", "parent": null, "children": [{"cref": "#/texts/0"}, {"cref": "#/texts/1"}, {"cref": "#/texts/2"}, {"cref": "#/texts/3"}, {"cref": "#/texts/4"}, {"cref": "#/texts/5"}, {"cref": "#/texts/6"}, {"cref": "#/texts/7"}, {"cref": "#/texts/8"}, {"cref": "#/texts/9"}, {"cref": "#/texts/10"}, {"cref": "#/texts/11"}, {"cref": "#/texts/12"}, {"cref": "#/texts/13"}, {"cref": "#/texts/14"}, {"cref": "#/texts/15"}], "content_layer": "body", "name": "_root_", "label": "unspecified"}, "groups": [], "texts": [{"self_ref": "#/texts/0", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "section_header", "prov": [{"page_no": 1, "bbox": {"l": 133.76800537109375, "t": 667.1912231445312, "r": 315.91595458984375, "b": 654.4518432617188, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 23]}], "orig": "JavaScript Code Example", "text": "JavaScript Code Example", "level": 1}, {"self_ref": "#/texts/1", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 1, "bbox": {"l": 133.76800537109375, "t": 642.3280639648438, "r": 477.4827575683594, "b": 501.97412109375, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 887]}], "orig": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.", "text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."}, {"self_ref": "#/texts/2", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 1, "bbox": {"l": 133.76800537109375, "t": 498.86590576171875, "r": 477.478759765625, "b": 454.1541748046875, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 298]}], "orig": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,", "text": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,"}, {"self_ref": "#/texts/3", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "paragraph", "prov": [{"page_no": 1, "bbox": {"l": 223.15499877929688, "t": 442.0789489746094, "r": 388.09375, "b": 433.232177734375, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 36]}], "orig": "Listing 1: Simple JavaScript Program", "text": "Listing 1: Simple JavaScript Program"}, {"self_ref": "#/texts/4", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "code", "prov": [{"page_no": 1, "bbox": {"l": 134.23899841308594, "t": 425.6004943847656, "r": 263.2240905761719, "b": 385.25445556640625, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 60]}], "orig": "function add(a, b) { return a + b; } console.log(add(3, 5));", "text": "function add(a, b) { return a + b; } console.log(add(3, 5));", "captions": [], "references": [], "footnotes": [], "image": null, "code_language": "unknown"}, {"self_ref": "#/texts/5", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 1, "bbox": {"l": 133.76800537109375, "t": 372.93902587890625, "r": 477.4817199707031, "b": 232.58535766601562, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 887]}], "orig": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.", "text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."}, {"self_ref": "#/texts/6", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 1, "bbox": {"l": 133.76800537109375, "t": 229.47714233398438, "r": 477.478759765625, "b": 184.7643585205078, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 298]}], "orig": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,", "text": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet,"}, {"self_ref": "#/texts/7", "parent": {"cref": "#/body"}, "children": [], "content_layer": "furniture", "label": "page_footer", "prov": [{"page_no": 1, "bbox": {"l": 303.13299560546875, "t": 96.27914428710938, "r": 308.1142883300781, "b": 87.4323501586914, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 1]}], "orig": "1", "text": "1"}, {"self_ref": "#/texts/8", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "section_header", "prov": [{"page_no": 2, "bbox": {"l": 133.76800537109375, "t": 717.0812377929688, "r": 191.52723693847656, "b": 704.3418579101562, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 7]}], "orig": "Formula", "text": "Formula", "level": 1}, {"self_ref": "#/texts/9", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 2, "bbox": {"l": 133.76800537109375, "t": 692.2180786132812, "r": 477.4827575683594, "b": 551.8641357421875, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 887]}], "orig": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.", "text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."}, {"self_ref": "#/texts/10", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 2, "bbox": {"l": 133.76800537109375, "t": 548.7559204101562, "r": 477.48162841796875, "b": 492.0881042480469, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 369]}], "orig": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt.", "text": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt."}, {"self_ref": "#/texts/11", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "formula", "prov": [{"page_no": 2, "bbox": {"l": 280.55401611328125, "t": 479.0646667480469, "r": 330.6965637207031, "b": 468.1781005859375, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 12]}], "orig": "a 2 + 8 = 12", "text": ""}, {"self_ref": "#/texts/12", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 2, "bbox": {"l": 133.7679901123047, "t": 459.09185791015625, "r": 477.481689453125, "b": 318.73822021484375, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 887]}], "orig": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.", "text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."}, {"self_ref": "#/texts/13", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 2, "bbox": {"l": 133.7679901123047, "t": 315.6300048828125, "r": 477.48370361328125, "b": 247.00729370117188, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 415]}], "orig": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.", "text": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat."}, {"self_ref": "#/texts/14", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "text", "prov": [{"page_no": 2, "bbox": {"l": 133.7679901123047, "t": 243.89907836914062, "r": 477.48370361328125, "b": 175.2762908935547, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 415]}], "orig": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.", "text": "Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat."}, {"self_ref": "#/texts/15", "parent": {"cref": "#/body"}, "children": [], "content_layer": "furniture", "label": "page_footer", "prov": [{"page_no": 2, "bbox": {"l": 303.13299560546875, "t": 146.16807556152344, "r": 308.1142883300781, "b": 137.3212890625, "coord_origin": "BOTTOMLEFT"}, "charspan": [0, 1]}], "orig": "1", "text": "1"}], "pictures": [], "tables": [], "key_value_items": [], "pages": {"1": {"size": {"width": 612.0, "height": 792.0}, "image": null, "page_no": 1}, "2": {"size": {"width": 595.2760009765625, "height": 841.8900146484375}, "image": null, "page_no": 2}}}