feat(ocr): auto-detect rotated pages in Tesseract (#1167)

* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
Clément Doumouro
2025-05-21 18:12:33 +02:00
committed by GitHub
parent 90875247e5
commit 45265bf8b1
96 changed files with 9864 additions and 5258 deletions

View File

@@ -14942,9 +14942,9 @@
"page_no": 2,
"bbox": {
"l": 148.45364379882812,
"t": 583.6257476806641,
"t": 583.6257629394531,
"r": 464.3608093261719,
"b": 366.1538391113281,
"b": 366.1537780761719,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -15221,9 +15221,9 @@
{
"page_no": 7,
"bbox": {
"l": 164.6503143310547,
"l": 164.65028381347656,
"t": 628.2029113769531,
"r": 449.550537109375,
"r": 449.5505676269531,
"b": 511.6590576171875,
"coord_origin": "BOTTOMLEFT"
},
@@ -15475,7 +15475,7 @@
{
"page_no": 8,
"bbox": {
"l": 140.70960998535156,
"l": 140.70968627929688,
"t": 283.9361572265625,
"r": 472.73382568359375,
"b": 198.32281494140625,
@@ -15804,10 +15804,10 @@
{
"page_no": 10,
"bbox": {
"l": 162.67434692382812,
"t": 347.3774719238281,
"r": 451.70068359375,
"b": 128.786376953125,
"l": 162.67430114746094,
"t": 347.37744140625,
"r": 451.70062255859375,
"b": 128.78643798828125,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
@@ -15875,9 +15875,9 @@
{
"page_no": 11,
"bbox": {
"l": 168.3928985595703,
"l": 168.39285278320312,
"t": 610.0334930419922,
"r": 447.3513488769531,
"r": 447.35137939453125,
"b": 157.99432373046875,
"coord_origin": "BOTTOMLEFT"
},
@@ -17702,7 +17702,7 @@
"page_no": 10,
"bbox": {
"l": 143.6376495361328,
"t": 635.6522827148438,
"t": 635.6522979736328,
"r": 470.8485412597656,
"b": 528.7375183105469,
"coord_origin": "BOTTOMLEFT"