Search:  
Gentoo Wiki

OCR

Contents

Introduction

Optical Character Recognition, less precisely described as image to text conversion is an almost barren field with linux. The shortness of Open-Source-Solutions is only surpassed by the lack of commercial products.

Since OCR is not well-developed in the field of open source, but only covered by some few mostly experimental programs, this guide is nearly useless. One may hope that it might evolve one day to usefulness. OCR itself is done via some of the following steps:

While the feature recognition has the huge advantage of not needing being trained, it often doesn't achieve the results of a well trained FFT comparison.

tesseract-ocr

by far the best, head over heels compared to gocr, xocr, ocrad, ocre, clara ....:

http://code.google.com/p/tesseract-ocr

see also http://code.google.com/p/ocropus/

Installation

package.keywords is now a directory so it is much easier to manage. Simply put a file in it with the packages to keyword.

# echo "app-text/tesseract ~x86" >> /etc/portage/package.keywords/tesseract
# emerge tesseract 
# [ebuild   R   ] app-text/tesseract-2.00  USE="tiff" LINGUAS="-de -en -es -fr -it -nl" 0 kB
# *** You must select one of these LINGUAS variables, otherwise no dictionary/language information is downloaded! ***

gocr

gocr is the yet most advanced free OCR software for linux. One might find it worth a try. gocr uses feature analysis, so no training is needed. The last version has crude fft comparison database capabilities, which are switched off by default. As of version 0.40 there is no trigram or dictionary comparison in gocr.

Installation

emerge gocr

or just copy gocr.exe anywhere in your path!

Usage

gocr filename.pnm

will analyze and output the text. It should even have the capability to recognize barcodes, though I have not tested it myself.

Man page

gocr [options] pnm_file_name  # use - for stdin
options:
-h        - get this help
-i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
-i -      - read PNM from stdin (djpeg -gray a.jpg | gocr -)
-o name   - output file  (redirection of stdout)
-e name   - logging file (redirection of stderr)
-x name   - progress output (file or fifo)
-p name   - database path (including final slash, default is ./db/)
-f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num    - threshold grey level 0<160<=255 (0 = autodetect)
-d num    - dust_size (remove smaller clusters, -1 = autodetect)
-s num    - spacewidth/dots (0 = autodetect)
-v num    - verbose  [summed]
     1      print more info
     2      list shapes  of boxes (see -c)
     4      list pattern of boxes (see -c)
     8      print pattern after recognition
    16      print line infos
    32      debug outXX.pgm
-c string - list of chars (_ = not recognized chars, debug)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num    - operation modes, ~ = switch off
     2      use database (early development)
     4      layout analysis, zoning (development)
     8      ~ compare non recognized chars
    16      ~ divide overlapping chars
    32      ~ context correction
    64      char packing (development)
   130      extend database, prompts user (128+2, early development)
   256      switch off the OCR engine (makes sense together with -m 2)
-n   1      only numbers
examples:
       gocr -v 33 text1.pbm                # some infos + out30.bmp
       gocr -v 7 -c _YV text1.pbm          # list unknown, Y and V chars
       djpeg -pnm -gray text.jpg | gocr -  # use jpeg-file via pipe

Others

Other OCR engines are clara OCR (which is currently being rewritten since end of 2003), one of the few trainable programs, ocre, GNU ocrad, while experimental non-engines are xplab and gamera, the latter not being real OCR programs but engines to aid the later build of OCR-engines.

Between all of them, ocre seems to be most developed, ocrad being close up. None of them currently features trigram or dictionary comparison.

ABBYY has released the finereader OCR engine for linux. Since it is both closed source and relatively expensive, I haven't laid my hands on it yet.

Vividata provides Optical Character Recognition and Image Processing software for Linux and UNIX environments for commercial usage, high-volume applications, and customized applications.

See also

Retrieved from "http://www.gentoo-wiki.info/OCR"

Last modified: Wed, 01 Oct 2008 07:07:00 +0000 Hits: 11,664