- Published on
 
Tesseract and Python on Fedora
- Authors
 
- Name
 - Martin Andrews
 - @mdda123
 
Fedora, while including a comprehensive tesseract set of rpms, doesn't have the equivalent of tesseract-python, so I needed something to build/import easily.
However, quick Google searches offer several solutions that now look inappropriate :
pip install tesseract appears to contain a complete, old
tesseractbuild - it's 40Mbpip install pytesseract is GPL3, which is inappropriate for my use-case
Using
tesseractdirectly via commandline embedding
However, reading the tesseract project's wiki pages on github indicate that there are several other choices available, and I (somewhat arbitrarily) chose tesserocr, which is MIT licensed, and has a fairly comprehensive API into the 'raw' tesseract C/C++ code.
Installation of tesserocr
sudo dnf install tesseract-devel
pip install tesserocr
Usage
The following also includes hints for interoperability with opencv using pillow (for PIL), which can be helpful in cleaning up the image prior to textextraction. It's useful to pre-clean, even though tesseract iteself does some cleaning, because there's often application-specific knowledge that can be used more effectively than the tesseract generic methods.
from tesserocr import PyTessBaseAPI, RIL, PSM
im = cleaned_view
from PIL import Image
im_pil = Image.fromarray(cv2.cvtColor(im, cv2.COLOR_BGR2RGB))
plt.imshow(im_pil, 'gray')
plt.show()
#  PSM : https://github.com/sirfz/tesserocr/blob/3c699c8cff7a7c5552e7bf51d5631bbf95414c9c/tesseract.pxd#L214
with PyTessBaseAPI(psm=PSM.SINGLE_BLOCK) as ocr:
    ocr.SetImage(im_pil)
    boxes = ocr.GetComponentImages(RIL.TEXTLINE, True)
    print 'Found {} textline image components.'.format(len(boxes))
    for i, (im, box, _, _) in enumerate(boxes):
        # box is a dict with x, y, w and h keys
        ocr.SetRectangle(box['x'], box['y'], box['w'], box['h'])
        ocrResult = ocr.GetUTF8Text()
        conf = ocr.MeanTextConf()
        print (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
               "confidence: {1}, text: {2}").format(i, conf, ocrResult.replace('\n',''), **box)