Docling/ocr_utils.py at 1d680b0a321d95fc6bd65b7bb4d5e15005a0250a - Docling - Gitea: Git with a cup of tea

NeoAnd/Docling

Nikos Livathinos 3be2fb581f

feat: Introduce automatic language detection in TesseractOcrCliModel (#800 )

* feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Add example how to use "auto" language with tesseract OCR engines

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected
language is installed in the system and if not fall back to a default option without language.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

2025-01-26 08:07:56 +01:00

10 lines

263 B

Python

Raw Blame History

 def map_tesseract_script(script: str) -> str:
     r""" """
     if script == "Katakana" or script == "Hiragana":
         script = "Japanese"
     elif script == "Han":
         script = "HanS"
     elif script == "Korean":
         script = "Hangul"
     return script