Docling/tests/data/groundtruth/docling_v1/right_to_left_01.json
Michele Dolfi 9114ada7bc
fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903)
fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup


* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* cleaned up the data folder in the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* switch to code formula model v1.0.1 and new test pdf

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added three test-files for right-to-left

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix black

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* added new gt for test_e2e_conversion

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* Add code to expose text direction of cell

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* new test file

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>

* update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix mypy reports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix example filepaths

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add test data results

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin wheel of latest docling-parse release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use latest docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove debugging code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix path to files in example

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Revert unwanted RTL additions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix test data paths in examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-02-07 08:43:31 +01:00

1 line
9.4 KiB
JSON

{"_name": "", "type": "pdf-document", "description": {"title": null, "abstract": null, "authors": null, "affiliations": null, "subjects": null, "keywords": null, "publication_date": null, "languages": null, "license": null, "publishers": null, "url_refs": null, "references": null, "publication": null, "reference_count": null, "citation_count": null, "citation_date": null, "advanced": null, "analytics": null, "logs": [], "collection": null, "acquisition": null}, "file-info": {"filename": "right_to_left_01.pdf", "filename-prov": null, "document-hash": "85c9c0772fa51fd26f16eaae6abd522c96a4d169ceb7b72cbcfe3444ce22db79", "#-pages": 1, "collection-name": null, "description": null, "page-hashes": [{"hash": "6400df9d1750f707e1e0b310224d0b988ed99457bd230029715def0a6030dd06", "model": "default", "page": 1}]}, "main-text": [{"prov": [{"bbox": [223.85000610351562, 704.4510498046875, 521.9818115234375, 719.4619750976562], "page": 1, "span": [0, 59], "__ref_s3_data": null}], "text": "Python\u0648 R \u0629\u063a\u0644\u0628 \u0629\u062c\u0645\u0631\u0628\u0644\u0627 \u0644\u0644\u0627\u062e \u0646\u0645 \u062a\u0644\u0627\u0643\u0634\u0645\u0644\u0627 \u0644\u062d\u0648 \u0629\u064a\u062c\u0627\u062a\u0646\u0644\u0625\u0627 \u0646\u064a\u0633\u062d\u062a", "type": "subtitle-level-1", "payload": null, "name": "Section-header", "font": null}, {"prov": [{"bbox": [90.74400329589844, 635.3080444335938, 522.1900024414062, 689.9920043945312], "page": 1, "span": [0, 345], "__ref_s3_data": null}], "text": "Python \u0648 R \u0629\u063a\u0644\u0628 \u0629\u062c\u0645\u0631\u0628\u0644\u0627 \u0631\u0628\u062a\u0639\u062a \u0629\u0644\u0627\u0639\u0641 \u0644\u0648\u0644\u062d \u062f\u0627\u062c\u064a\u0625 \u064a\u0641 \u062f\u0639\u0627\u0633\u062a\u0648 \u0629\u064a\u062c\u0627\u062a\u0646\u0644\u0625\u0627 \u0632\u0632\u0639\u062a \u0646\u0623 \u0646\u0643\u0645\u064a \u064a\u062a\u0644\u0627 \u0629\u064a\u0648\u0642\u0644\u0627 \u062a\u0627\u0648\u062f\u0644\u0623\u0627 \u0646\u0645 \u0621\u0627\u0645\u0644\u0639\u0644\u0627\u0648 \u0646\u064a\u0644\u0644\u062d\u0645\u0644\u0627 \u0649\u0644\u0639 \u0644\u0647\u0633\u064a \u0627\u0645\u0645 \u060c\u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0644\u064a\u0644\u062d\u062a\u0644 \u0629\u064a\u0644\u0627\u062b\u0645 \u0627\u0647\u0644\u0639\u062c\u062a \u0629\u062f\u064a\u0631\u0641 \u062a\u0627\u0632\u064a\u0645Python \u0648 R \u0646\u0645 \u0644\u0643 \u0643\u0644\u062a\u0645\u064a .\u062a\u0644\u0627\u0643\u0634\u0645\u0644\u0644 \u0646\u0627\u0643 \u0627\u0630\u0625 .\u0629\u0644\u0627\u0639\u0641\u0648 \u0629\u0639\u064a\u0631\u0633 \u0629\u0642\u064a\u0631\u0637\u0628 \u0629\u062f\u0642\u0639\u0645 \u062a\u0644\u0627\u064a\u0644\u062d\u062a \u0621\u0627\u0631\u062c\u0625 \u0645\u0647\u0633\u064a \u0646\u0623 \u0646\u0643\u0645\u064a \u062a\u0627\u063a\u0644\u0644\u0627 \u0647\u0630\u0647 \u0645\u0627\u062f\u062e\u062a\u0633\u0627 \u0646\u0625\u0641 \u060c\u0629\u064a\u0644\u064a\u0644\u062d\u062a \u0629\u064a\u0644\u0642\u0639 \u0643\u064a\u062f\u0644 .\u0644\u0645\u0639\u0644\u0627 \u062c\u0626\u0627\u062a\u0646 \u0646\u064a\u0633\u062d\u062a \u064a\u0641 \u0631\u064a\u0628\u0643 \u0644\u0643\u0634\u0628", "type": "paragraph", "payload": null, "name": "Text", "font": null}, {"prov": [{"bbox": [208.10401916503906, 579.3880615234375, 208.10401916503906, 592.6720581054688], "page": 1, "span": [0, 1], "__ref_s3_data": null}], "text": "\u064b", "type": "paragraph", "payload": null, "name": "Text", "font": null}, {"prov": [{"bbox": [99.86399841308594, 566.0679931640625, 522.2379150390625, 620.7520141601562], "page": 1, "span": [0, 348], "__ref_s3_data": null}], "text": "\u062c\u0627\u0631\u062e\u062a\u0633\u0627\u0648 \u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0646\u0645 \u0629\u0644\u0626\u0627\u0647 \u062a\u0627\u064a\u0645\u0643 \u0629\u062c\u0644\u0627\u0639\u0645 \u0646\u0643\u0645\u0645\u0644\u0627 \u0646\u0645 \u062d\u0628\u0635\u064a \u060c\u0629\u062c\u0645\u0631\u0628\u0644\u0627 \u062a\u0627\u0631\u0627\u0647\u0645 \u0639\u0645 \u064a\u0644\u064a\u0644\u062d\u062a\u0644\u0627 \u0631\u064a\u0643\u0641\u062a\u0644\u0627 \u0639\u0645\u062a\u062c\u064a \u0627\u0645\u062f\u0646\u0639 \u0630\u064a\u0641\u0646\u062a\u0644Python \u0648 R \u0645\u0627\u062f\u062e\u062a\u0633\u0627 \u0646\u064a\u062c\u0645\u0631\u0628\u0645\u0644\u0644 \u0646\u0643\u0645\u064a .\u0627\u0647\u0646\u0645 \u062a\u0627\u0647\u062c\u0648\u062a\u0644\u0627\u0648 \u0637\u0627\u0645\u0646\u0644\u0623\u0627 \u0629\u062c\u0630\u0645\u0646\u0644\u0627 \u0644\u062b\u0645 \u060c\u0629\u0645\u062f\u0642\u062a\u0645 \u0629\u064a\u0644\u064a\u0644\u062d\u062a \u062a\u0627\u064a\u0644\u0645\u0639 \u0629\u0642\u062f \u0631\u062b\u0643\u0623 \u062a\u0627\u0631\u0627\u0631\u0642 \u0630\u0627\u062e\u062a\u0627 \u0649\u0644\u0625 \u0627 \u0636\u064a\u0623 \u064a\u062f\u0624\u064a \u0646\u0623 \u0646\u0643\u0645\u064a \u0644\u0628 \u060c\u062a\u0642\u0648\u0644\u0627 \u0631\u0641\u0648\u064a \u0637\u0642\u0641 \u0633\u064a\u0644 \u0627\u0630\u0647 .\u0629\u0631\u064a\u0628\u0643\u0644\u0627 \u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0644\u064a\u0644\u062d\u062a\u0648 \u0629\u064a\u0626\u0627\u0635\u062d\u0644\u0625\u0627 \u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0649\u0644\u0639 \u0629\u0645\u0626\u0627\u0642 \u062a\u0627\u062c\u0627\u062a\u0646\u062a\u0633\u0627 \u0649\u0644\u0639 \u0621\u0627\u0646\u0628 .", "type": "paragraph", "payload": null, "name": "Text", "font": null}, {"prov": [{"bbox": [509.34991455078125, 564.7479858398438, 509.34991455078125, 578.031982421875], "page": 1, "span": [0, 1], "__ref_s3_data": null}], "text": "\u064b", "type": "paragraph", "payload": null, "name": "Text", "font": null}, {"prov": [{"bbox": [92.90399932861328, 496.9179992675781, 522.10595703125, 551.6320190429688], "page": 1, "span": [0, 375], "__ref_s3_data": null}], "text": "\u0644\u064a\u0644\u062d\u062a\u0644\u0627 \u0646\u0645 \u060c\u062a\u0627\u0642\u064a\u0628\u0637\u062a\u0644\u0627 \u0646\u0645 \u0629\u0639\u0633\u0627\u0648 \u0629\u0639\u0648\u0645\u062c\u0645 \u0645\u0639\u062f\u062a \u0629\u064a\u0646\u063a \u062a\u0627\u0648\u062f\u0623\u0648 \u062a\u0627\u0628\u062a\u0643\u0645Python \u0648 R \u0646\u0645 \u0644\u0643 \u0631\u0641\u0648\u062a \u060c\u0643\u0644\u0630 \u0649\u0644\u0639 \u0629\u0648\u0644\u0627\u0639 \u0649\u0644\u0639 .\u0629\u0641\u0644\u062a\u062e\u0645\u0644\u0627 \u062a\u0644\u0627\u0643\u0634\u0645\u0644\u0644 \u0629\u0631\u0643\u062a\u0628\u0645 \u0644\u0648\u0644\u062d \u0631\u064a\u0648\u0637\u062a\u0644 \u062a\u0627\u0628\u062a\u0643\u0645\u0644\u0627 \u0647\u0630\u0647 \u0646\u0645 \u0629\u062f\u0627\u0641\u062a\u0633\u0644\u0627\u0627 \u0646\u064a\u0645\u062f\u062e\u062a\u0633\u0645\u0644\u0644 \u0646\u0643\u0645\u064a .\u064a\u0644\u0644\u0622\u0627 \u0645\u0644\u0639\u062a\u0644\u0627 \u0649\u0644\u0625 \u064a\u0646\u0627\u064a\u0628\u0644\u0627 R \u0631\u0641\u0648\u062a \u0627\u0645\u0646\u064a\u0628 \u060c\u0629\u0621\u0627\u0641\u0643\u0628 \u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0629\u0631\u0627\u062f\u0644\u0625 Python \u064a\u0641 pandas \u0629\u0628\u062a\u0643\u0645 \u0645\u0627\u062f\u062e\u062a\u0633\u0627 \u0646\u0643\u0645\u064a \u060c\u0644\u0627\u062b\u0645\u0644\u0627 \u0644\u064a\u0628\u0633 \u0645\u0633\u0631\u0644\u0644 \u0629\u064a\u0648\u0642 \u062a\u0627\u0648\u062f\u0623 .\u0646\u064a\u0644\u0644\u062d\u0645\u0644\u0627\u0648 \u0646\u064a\u062b\u062d\u0627\u0628\u0644\u0644 \u0629\u064a\u0644\u0627\u062b\u0645 \u0627\u0647\u0644\u0639\u062c\u064a \u0627\u0645\u0645 \u060c\u064a\u0626\u0627\u0635\u062d\u0644\u0625\u0627 \u0644\u064a\u0644\u062d\u062a\u0644\u0627\u0648 \u064a\u0646\u0627\u064a\u0628\u0644\u0627", "type": "paragraph", "payload": null, "name": "Text", "font": null}, {"prov": [{"bbox": [96.86399841308594, 441.4779968261719, 522.0740356445312, 482.36199951171875], "page": 1, "span": [0, 267], "__ref_s3_data": null}], "text": "Python \u0648 R \u0629\u063a\u0644\u0628 \u0629\u062c\u0645\u0631\u0628\u0644\u0627 \u064a\u062f\u0624\u062a \u0646\u0623 \u0646\u0643\u0645\u064a \u060c\u0629\u064a\u0627\u0647\u0646\u0644\u0627 \u064a\u0641 \u0629\u0631\u0643\u062a\u0628\u0645 \u0644\u0648\u0644\u062d \u0631\u064a\u0641\u0648\u062a\u0648 \u0629\u064a\u062c\u0627\u062a\u0646\u0644\u0625\u0627 \u0646\u064a\u0633\u062d\u062a \u0649\u0644\u0625 \u0629\u064a\u0644\u064a\u0644\u062d\u062a \u0629\u064a\u0644\u0642\u0639 \u0639\u0645 \u0627\u0647\u0644 \u0646\u0648\u0643\u062a \u0646\u0623 \u0646\u0643\u0645\u064a \u0629\u0628\u0633\u0627\u0646\u0645\u0644\u0627 \u0629\u064a\u062c\u0645\u0631\u0628\u0644\u0627 \u0628\u064a\u0644\u0627\u0633\u0644\u0623\u0627 \u0642\u064a\u0628\u0637\u062a\u0648 \u0644\u0627\u0639\u0641 \u0644\u0643\u0634\u0628 \u062a\u0627\u0646\u0627\u064a\u0628\u0644\u0627 \u0644\u064a\u0644\u062d\u062a \u0649\u0644\u0639 \u0629\u0631\u062f\u0642\u0644\u0627 \u0646\u0625 .\u0629\u062f\u0642\u0639\u0645\u0644\u0627 \u062a\u0644\u0627\u0643\u0634\u0645\u0644\u0644 .\u064a\u0646\u0647\u0645\u0644\u0627\u0648 \u064a\u0635\u062e\u0634\u0644\u0627 \u0621\u0627\u062f\u0644\u0623\u0627 \u0649\u0644\u0639 \u0649\u062f\u0645\u0644\u0627 \u0629\u062f\u064a\u0639\u0628 \u0629\u064a\u0628\u0627\u062c\u064a\u0625 \u062a\u0627\u0631\u064a\u062b\u0623\u062a", "type": "paragraph", "payload": null, "name": "Text", "font": null}], "figures": [], "tables": [], "bitmaps": null, "equations": [], "footnotes": [], "page-dimensions": [{"height": 792.0, "page": 1, "width": 612.0}], "page-footers": [], "page-headers": [], "_s3_data": null, "identifiers": null}