
* fix(ocr): tesseract support mis-oriented documents Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): update missing test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): rotate image to the natural orientation before layout prediction Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): move bounding bow rotation util to orientation.py Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): refactor rotation utilities Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): avoid to swallow tesseract errors causing orientation detection failures Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel` * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel` * chore(ocr): default `TesseractOcrCliModel._is_auto` to `False` * fix(ocr): fix `TesseractOcrCliModel._is_auto` computation * chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel` --------- Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
286 lines
8.7 KiB
JSON
Vendored
286 lines
8.7 KiB
JSON
Vendored
{
|
|
"_name": "",
|
|
"type": "pdf-document",
|
|
"description": {
|
|
"title": null,
|
|
"abstract": null,
|
|
"authors": null,
|
|
"affiliations": null,
|
|
"subjects": null,
|
|
"keywords": null,
|
|
"publication_date": null,
|
|
"languages": null,
|
|
"license": null,
|
|
"publishers": null,
|
|
"url_refs": null,
|
|
"references": null,
|
|
"publication": null,
|
|
"reference_count": null,
|
|
"citation_count": null,
|
|
"citation_date": null,
|
|
"advanced": null,
|
|
"analytics": null,
|
|
"logs": [],
|
|
"collection": null,
|
|
"acquisition": null
|
|
},
|
|
"file-info": {
|
|
"filename": "picture_classification.pdf",
|
|
"filename-prov": null,
|
|
"document-hash": "959854dff729acaa22404d629a45cefcad8d942e595961185fc03a80d9fcc3a1",
|
|
"#-pages": 2,
|
|
"collection-name": null,
|
|
"description": null,
|
|
"page-hashes": [
|
|
{
|
|
"hash": "d9e3fc1226356b30c66012f05ad14089b00c59ea129195cd6ff8a0c68bda6f39",
|
|
"model": "default",
|
|
"page": 1
|
|
},
|
|
{
|
|
"hash": "9386884e13a97ce9662210a7e4258bbbb4f2e0e00663636160918e55b2806575",
|
|
"model": "default",
|
|
"page": 2
|
|
}
|
|
]
|
|
},
|
|
"main-text": [
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
133.76801,
|
|
654.45184,
|
|
252.35513,
|
|
667.19122
|
|
],
|
|
"page": 1,
|
|
"span": [
|
|
0,
|
|
15
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Figures Example",
|
|
"type": "subtitle-level-1",
|
|
"payload": null,
|
|
"name": "Section-header",
|
|
"font": null
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
133.76801,
|
|
501.97412,
|
|
477.48276,
|
|
642.32806
|
|
],
|
|
"page": 1,
|
|
"span": [
|
|
0,
|
|
887
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.",
|
|
"type": "paragraph",
|
|
"payload": null,
|
|
"name": "Text",
|
|
"font": null
|
|
},
|
|
{
|
|
"name": "Picture",
|
|
"type": "figure",
|
|
"$ref": "#/figures/0"
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
226.89101,
|
|
254.01826000000005,
|
|
384.3548,
|
|
262.86505
|
|
],
|
|
"page": 1,
|
|
"span": [
|
|
0,
|
|
35
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Figure 1: This is an example image.",
|
|
"type": "caption",
|
|
"payload": null,
|
|
"name": "Caption",
|
|
"font": null
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
133.76801,
|
|
122.51225,
|
|
477.48172000000005,
|
|
238.95505000000003
|
|
],
|
|
"page": 1,
|
|
"span": [
|
|
0,
|
|
747
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
|
|
"type": "paragraph",
|
|
"payload": null,
|
|
"name": "Text",
|
|
"font": null
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
133.76801,
|
|
523.7951,
|
|
477.48172000000005,
|
|
664.1490499999999
|
|
],
|
|
"page": 2,
|
|
"span": [
|
|
0,
|
|
887
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.",
|
|
"type": "paragraph",
|
|
"payload": null,
|
|
"name": "Text",
|
|
"font": null
|
|
},
|
|
{
|
|
"name": "Picture",
|
|
"type": "figure",
|
|
"$ref": "#/figures/1"
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
226.89101,
|
|
259.94226000000003,
|
|
384.3548,
|
|
268.78903
|
|
],
|
|
"page": 2,
|
|
"span": [
|
|
0,
|
|
35
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Figure 2: This is an example image.",
|
|
"type": "caption",
|
|
"payload": null,
|
|
"name": "Caption",
|
|
"font": null
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
133.76801,
|
|
117.32024000000001,
|
|
477.48172000000005,
|
|
245.71804999999995
|
|
],
|
|
"page": 2,
|
|
"span": [
|
|
0,
|
|
804
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.",
|
|
"type": "paragraph",
|
|
"payload": null,
|
|
"name": "Text",
|
|
"font": null
|
|
}
|
|
],
|
|
"figures": [
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
134.9200439453125,
|
|
281.78173828125,
|
|
475.6635437011719,
|
|
487.109375
|
|
],
|
|
"page": 1,
|
|
"span": [
|
|
0,
|
|
35
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Figure 1: This is an example image.",
|
|
"type": "figure",
|
|
"payload": null,
|
|
"bounding-box": null
|
|
},
|
|
{
|
|
"prov": [
|
|
{
|
|
"bbox": [
|
|
218.8155517578125,
|
|
283.10589599609375,
|
|
391.96246337890625,
|
|
513.9846496582031
|
|
],
|
|
"page": 2,
|
|
"span": [
|
|
0,
|
|
35
|
|
],
|
|
"__ref_s3_data": null
|
|
}
|
|
],
|
|
"text": "Figure 2: This is an example image.",
|
|
"type": "figure",
|
|
"payload": null,
|
|
"bounding-box": null
|
|
}
|
|
],
|
|
"tables": [],
|
|
"bitmaps": null,
|
|
"equations": [],
|
|
"footnotes": [],
|
|
"page-dimensions": [
|
|
{
|
|
"height": 792.0,
|
|
"page": 1,
|
|
"width": 612.0
|
|
},
|
|
{
|
|
"height": 792.0,
|
|
"page": 2,
|
|
"width": 612.0
|
|
}
|
|
],
|
|
"page-footers": [],
|
|
"page-headers": [],
|
|
"_s3_data": null,
|
|
"identifiers": null
|
|
} |