feat(ocr): auto-detect rotated pages in Tesseract (#1167)

* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
Clément Doumouro
2025-05-21 18:12:33 +02:00
committed by GitHub
parent 90875247e5
commit 45265bf8b1
96 changed files with 9864 additions and 5258 deletions
@@ -16866,10 +16866,10 @@
{
"page_no": 1,
"bbox": {
"l": 323.4081115722656,
"l": 323.408203125,
"t": 541.6512603759766,
"r": 553.295166015625,
"b": 266.14935302734375,
"r": 553.2952270507812,
"b": 266.1492919921875,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -16941,9 +16941,9 @@
"page_no": 3,
"bbox": {
"l": 88.33030700683594,
"t": 699.1134490966797,
"t": 699.1134796142578,
"r": 263.7049560546875,
"b": 571.4317626953125,
"b": 571.4317321777344,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -16979,9 +16979,9 @@
"page_no": 4,
"bbox": {
"l": 53.05912780761719,
"t": 481.20867919921875,
"t": 481.2087097167969,
"r": 295.8506164550781,
"b": 251.1358642578125,
"b": 251.135986328125,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -17255,9 +17255,9 @@
"page_no": 4,
"bbox": {
"l": 98.93103790283203,
"t": 654.5244903564453,
"t": 654.5245208740234,
"r": 512.579833984375,
"b": 497.91845703125,
"b": 497.91851806640625,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -23506,7 +23506,7 @@
"page_no": 6,
"bbox": {
"l": 62.02753829956055,
"t": 596.3199462890625,
"t": 596.3199310302734,
"r": 285.78955078125,
"b": 440.3381042480469,
"coord_origin": "BOTTOMLEFT"
@@ -26668,9 +26668,9 @@
{
"page_no": 7,
"bbox": {
"l": 80.35527038574219,
"l": 80.35525512695312,
"t": 641.0637054443359,
"r": 267.00823974609375,
"r": 267.0082092285156,
"b": 496.5545349121094,
"coord_origin": "BOTTOMLEFT"
},
@@ -31588,10 +31588,10 @@
{
"page_no": 8,
"bbox": {
"l": 72.65901947021484,
"t": 619.5191650390625,
"r": 274.8346862792969,
"b": 452.14599609375,
"l": 72.6590347290039,
"t": 619.5191955566406,
"r": 274.83465576171875,
"b": 452.1459655761719,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [