feat(ocr): auto-detect rotated pages in Tesseract (#1167)

* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
Clément Doumouro
2025-05-21 18:12:33 +02:00
committed by GitHub
parent 90875247e5
commit 45265bf8b1
96 changed files with 9864 additions and 5258 deletions

View File

@@ -209,7 +209,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.1",
"text": "Paragraph 1.1"
"text": "Paragraph 1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/5",
@@ -233,7 +239,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.2",
"text": "Paragraph 1.2"
"text": "Paragraph 1.2",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/7",
@@ -298,7 +310,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.1.1",
"text": "Paragraph 1.1.1"
"text": "Paragraph 1.1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/11",
@@ -322,7 +340,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.1.2",
"text": "Paragraph 1.1.2"
"text": "Paragraph 1.1.2",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/13",
@@ -390,7 +414,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.1.1",
"text": "Paragraph 1.1.1"
"text": "Paragraph 1.1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/17",
@@ -414,7 +444,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.1.2",
"text": "Paragraph 1.1.2"
"text": "Paragraph 1.1.2",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/19",
@@ -482,7 +518,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.2.3.1",
"text": "Paragraph 1.2.3.1"
"text": "Paragraph 1.2.3.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/23",
@@ -506,7 +548,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 1.2.3.1",
"text": "Paragraph 1.2.3.1"
"text": "Paragraph 1.2.3.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/25",
@@ -567,7 +615,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.1",
"text": "Paragraph 2.1"
"text": "Paragraph 2.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/30",
@@ -591,7 +645,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.2",
"text": "Paragraph 2.2"
"text": "Paragraph 2.2",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/32",
@@ -656,7 +716,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.1.1.1",
"text": "Paragraph 2.1.1.1"
"text": "Paragraph 2.1.1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/36",
@@ -680,7 +746,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.1.1.1",
"text": "Paragraph 2.1.1.1"
"text": "Paragraph 2.1.1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/38",
@@ -748,7 +820,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.1.1",
"text": "Paragraph 2.1.1"
"text": "Paragraph 2.1.1",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/42",
@@ -772,7 +850,13 @@
"label": "paragraph",
"prov": [],
"orig": "Paragraph 2.1.2",
"text": "Paragraph 2.1.2"
"text": "Paragraph 2.1.2",
"formatting": {
"bold": false,
"italic": false,
"underline": false,
"strikethrough": false
}
},
{
"self_ref": "#/texts/44",