色々と考えてみる: 自分のための「LinuxでOCR」

Linux でも利用できるOCRライブラリに tesseract-ocr というのがある。
詳しい説明は省略。英語環境であればかなり快適。
場合によっては、商用のものよりも精度も良い。

このライブラリ、基本的にはCUIで動作するので、
バッチ処理をしたい場合に便利。今回は、Pythonで書いてみた。
比較的、快適に動いたので、成功例としてUP。

以下、説明に迫られて書いた極めて「テキトー」なコード。
今回、本当にメモ。動く保証は無い。使う人はいるかもしれない。
まぁ、いつもどおりのことかもしれないが...。

以下からソースコードをダウンロードすることができる。
https://docs.google.com/open?id=0B9OtIs5BjRVha21YQXlHSGZUTWc

使い方は簡単。コンソール（端末）上で、以下のように入力する。
Pythonパスを通して居ない場合には、"cis.ocr.py" もフルパスで。
引数「-d」の後に、OCRに掛けたい画像が入ったディレクトリを指定する。

python cis.ocr.py -d 'path/to/image/directory'

実際のコードを見て見て解るように、
以下のコードではビットマップ画像のみサポートしている。
修正すれば、JPEGやTIFFにも対応可能。その場合は拡張子の判定部分を修正。

ふむ...それにしても、このコードはコピペ感が強い。
もちろん、コピペ元は自分のコードであるが。
次に見直す時には、もう少し、コードを整理してみるか。

#!/usr/bin/python
# -*- coding: utf-8
# Import modules
import subprocess, string, shlex, os, Image, numpy, tarfile, math, nltk, shutil, sys, getopt, zipfile
import xml.etree.ElementTree as et
from sets import Set
from progressbar import *

def txtCreansing(file):
'''
Cleansing OCR text. Converting some unreadable characters into readable characters.
@param file: File path for cleansing.
@type file: C{string}
@returns: Cleaned text with string type.
@rtype: C{string}
'''
text = ""
text = text + open(file).read().translate(None, "-\n\\")
text = text.replace("\xe2\x80\x90", "-")
text = text.replace("\xe2\x80\x91", "-")
text = text.replace("\xe2\x80\x92", "-")
text = text.replace("\xe2\x80\x93", "-")
text = text.replace("\xe2\x80\x94", "-")
text = text.replace("\xe2\x80\x95", "-")
text = text.replace("\xe2\x80\x96", "ll")
text = text.replace("\xe2\x80\x98", "'")
text = text.replace("\xe2\x80\x99", "'")
text = text.replace("\xe2\x80\x9a", ",")
text = text.replace("\xe2\x80\x9b", "'")
text = text.replace("\xe2\x80\x9c", '"')
text = text.replace("\xe2\x80\x9d", '"')
text = text.replace("\xe2\x80\x9e", ',,')
text = text.replace("\xe2\x80\x9f", '"')
text = text.replace("\xe2\x80\x93", "-")
text = text.replace("\xe2\x80\x94", "--")
text = text.replace("\xe2\x80\xa4", ".")
text = text.replace("\xe2\x80\xa5", "..")
text = text.replace("\xe2\x80\xa6", "...")
text = text.replace("\xef\xac\x81", "fi")
text = text.replace("\xef\xac\x82", "fl")
text = text.replace("\xef\xac\x83", "ffi")
text = text.replace("\xef\xac\x84", "ffl")
text = text.replace("\xef\xac\x85", "ft")
text = text.replace("\xef\xac\x86", "st")
text = text.replace("\xef\xac\x9d", "!")
text = text.replace("\xef\xac\x9f", "!!")
text = text.replace("\xc3\xa9", "e")
text = text.replace("\xc2\xb0", "'")
text = text.replace("\xc2\xa2", "c")
test = text.replace("``", '"')
text = text.replace(".", " . ")
text = text.replace(",", ", ")
text = text.replace(" ", " ")
text = text.replace("'r", "t")
text = text.replace("'", "\\'")
text = text.replace('"', '\\"')
return text

def getTextFromImage(path):
'''
OCR by using tesseract-OCR.
'''
if os.path.exists(path + ".txt") == False:
args = "tesseract '%s' '%s'" % (path, path)
subprocess.call(shlex.split(args),shell=False)
text = txtCreansing(path + ".txt")
else:
text = txtCreansing(path + ".txt")
return text

def main(argv):
indir = ''
if len(argv) == 0:
print 'Input directory is not specified:'
print 'cis.ocr.py -d <input directory>'
sys.exit(2)
try:
opts, args = getopt.getopt(argv,"hd:",["idir="])
if len(opts) == 0:
print 'Input directory is not specified:'
print 'cis.ocr.py -d <input directory>'
sys.exit(2)
except getopt.GetoptError:
print 'cis.ocr.py -d <input directory>'
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print 'cis.ocr.py -d <input directory>'
sys.exit()
elif opt in ("-d", "--idir"):
indir = arg

for child in os.listdir(indir):
# Get a full path of the directories.
fullpath = os.path.join(indir, child)
filename = os.path.splitext(fullpath)[0]
extension = os.path.splitext(fullpath)[1].lower()
if os.path.isdir(fullpath):
#print(fullpath)
getXmlDir(fullpath)
elif extension == ".bmp":
getTextFromImage(fullpath)

if __name__ == "__main__":
# Path to an original xml file object.
main(sys.argv[1:])

色々と考えてみる

ページ

2012/11/26

自分のための「LinuxでOCR」

0 件のコメント:

コメントを投稿