Clara OCR Tutorial

[Main] [FAQ] [Glossary] [Tutorial] [User's Manual] [Developer's Guide]

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual" (man page clara-adv(1), also available in HTML format). Developers must read "The Clara OCR Developer's Guide" (man page clara-dev(1), also available in HTML format).

CONTENTS

1. Making OCR
2. AVAILABILITY
3. CREDITS

1. Making OCR

This section is a tutorial on the basic OCR features offerred by Clara OCR. Clara OCR is not simple to use. A basic knowledge about how it works is required for using it. Most complex features are not covered by this tutorial. If you need to compile Clara from the source code, read the INSTALL file and check (if necessary) the compilation hints on the Clara OCR Advanced User's Manual.

1.1 Starting Clara

So let's try it. Of course we need a scanned page to do so. Clara OCR requires graphic format PBM or PGM (TIFF, PBM, and others must be converted, the netpbm package contains various conversion tools). The Clara distribution package contains one small PBM file that you can use for a first test. The name of this file is imre.pbm. If you cannot locate it, download it or other files from http://www.claraocr.org/. Alternatively, you can produce your own 600-dpi PBM or PGM files scanning any printed document (hints for scanning pages and converting them to PBM are given on the section "Scanning books" of the Clara OCR Advanced User's Manual).

Once you have a PBM or PGM file to try, cd to the directory where the file resides and fire up Clara. Example:

$ cd /tmp/clara $ clara &

In order to make OCR tests, Clara will need to write files on that directory, so write permission is required, just like some free space.

Obs. As to version 0.9.9, Clara OCR heuristics are tuned to handle 600 dpi bitmaps. When using a different resolution, inform it using the -y switch:

$ clara -y 300 &

Then a window with menus and buttons will appear on your X display:

+-----------------------------------------------+ | File Edit OCR ... | +-----------------------------------------------+ | +--------+ +----+ +--------+ +-------+ | | | zoom | |page| |patterns| | tune | | | +--------+ +-+ +-+ +-+ +-+ | | +--------+ | +-------------------------+ | | | | zone | | | | | | | +--------+ | | | | | | +--------+ | | | | | | | OCR | | | WELCOME TO | | | | +--------+ | | | | | | +--------+ | | C L A R A O C R | | | | | stop | | | | | | | +--------+ | | | | | | . | | | | | | . | | | | | | | | | | | | | | | | | | | +-------------------------+ | | | +-----------------------------+ | | | | (status line) | +-----------------------------------------------+

Welcome aboard! The rectangle with the welcome message is called "the plate". As you already guessed, the small rectangles with the labels "zoom", "OCR", "stop", etc, are "the buttons". The "tabs" are those flaps labelled "page", "patterns" and "tune". On the menu bar you'll find the File menu, the Edit menu, and so on. Popup the "Options" menu, and change the current font size for better visualization, if required.

Press "L" to read the GPL, or select the "page" tab, and subsequently, select on the plate the imre.pbm page (or any other PBM or PGM file, if any). The OCR will load that file showing the progress of this operation on the status line on the bottom of the window.

note: the "page" tab is the flap labelled "page". This is unrelated to the "tab" key.

When the load operation completes, Clara will display the page. Press the OCR button and wait a bit. The letters will become grayed and the plate will split into three windows. Move the pointer along the plate and you'll see the tab label follow the current window: "page", "page (output)" or "page (symbol)". Move the pointer along the entire application window, and, for most components, you'll see a short context help message on the status line when the pointer reaches it (the buttons, for instance). Dialogs (user confirmations) also use the status line (like Emacs), instead of dialog boxes.

You can resize both the Clara application window or each of the three windows currently on the plate ("page", "page (output)" and "page (symbol)"). To resize the windows, select any point between two of them and drag the mouse. The scrollbars can become hidden (use the "hide scrollbars" on the View menu).

When the tab label is "page", press the "zoom" button using the mouse button 1 and the scanned image will zoom out. If you use the mouse button 3, the image will zomm in (the behaviour of the "zoom" button depends on the current window).

Now try selecting the "page" tab many times, and you will circulate the various display modes shared by this tab. These modes are and will be referred as "PAGE", "PAGE (fatbits)" and "PAGE (list)". Each display mode may have one or more windows We've chosen this uncommon approach because an excess of tabs transforms them in a useless decoration. The other tabs also offer various modes, some will be presented later by this tutorial.

1.2 Some few command-line switches

Besides the -y option used in the last subsection, Clara accepts many others, documented on the Clara OCR Advanced User's Manual. By now, from the various different ways to start Clara, we'll limit ourselves to some few examples:

clara clara -h

In the first case, Clara is just started. On the second, it will display a short help and exit.

clara -f path clara -f path -w workdir

The option -f informs the relative or absolute path of a scanned page or a directory with scanned pages (PBM or PGM files). The option -w informs the relative or absolute path of a work directory (where Clara will create the output and data files).

clara -i -f path -w workdir clara -b -f path -w workdir

The option -i activates dead keys emulation for composition of accents and characters. The -b switch is for batch processing. Clara will automatically perform one OCR run on the file informed through -f (or on all files found, if it is the path of a directory) and exit without displaying its window.

clara -Z 1 -F 7x13

Clara will start with the smallest possible window size.

A full reference of command-line switches is given on the section "Reference of command-line switches" of the Clara OCR Advanced User's Manual.

1.3 Training symbols

Yes, Clara OCR must be trained. Training is a tedious procedure, but it's a must for those who need a customizable OCR, apt to adapt to a perhaps uncommon printing style.

Before training, a process called segmentation must be performed. Press the right button of the mouse over the OCR button, select "Segmentation" on the menu that will pop out and wait the operation complete.

Now, on the "page" tab, observe the image of the document presented on the top window. You'll see the symbols greyed, because the OCR currently does not know their transliterations. Try to select one symbol using the mouse (click the mouse button 1 over it). A black elliptic cursor will appear around that symbol. This cursor is called the "graphic cursor". You can move the graphic cursor around the document using the arrow keys.

Now observe the bottom window on the "page" tab. That window presents some detailed information on the current symbol (that one identified by the graphic cursor). When the "show web clip" option on the "View" menu is selected, a clip of the document around the current symbol, is displayed too. In some cases, this clip is useful for better visualization. The name "web clip" is because this same image is exported to the Clara OCR web interface when cooperative training and revision through the Internet is being performed.

To inform the OCR about the transliteration of one symbol, just type the corresponding key. For instance, if the current symbol is a letter "a", just type the "a" key. Observe that the trained symbol becomes black. Each symbol trained will be learned by the OCR, its bitmap will be called a "pattern", and it will be used as such when trying to deduce the transliteration of unknown symbols.

Obs. in our test, the user chose the symbol to be trained. However, Clara OCR can choose by itself the symbols to be trained. This feature is called "build the bookfont automatically" (found on the "tune" tab). To use it, select the corresponding checkbos and classify the symbols as explained later.

Finally, when the transliteration cannot be informed through one single keystroke or composition (for instance when you wish to inform a TeX macro as being the transliteration of the current symbol), write down the transliteration using the text input field on the bottom window (select it using the mouse before).

1.4 Saving the session

Before going further, it's important to know how to save your work. The file menu contains one item labelled "save session". When selected, it will create or overwrite three files on the working directory: "patterns", "acts" and "page.session", where "page" is the name of the file currently loaded, without the "pbm" or "pgm" tag (in out example, "imre"). So, to remove all data produced by OCR sessions, remove manually the files "*.session", "patterns" and "acts".

Note that the files "patterns" and "acts" are shared by all PBM or PGM pages, so a symbol trained from one page is reused on the other pages. The ".session" files however are per-page. Pages with the same graphic characteristics, and only them, must be put on one same directory, in order to share the same patterns.

When the "quit" option of the "File" menu is selected, the OCR prompts the user for saving the session (answer pressing the key "y" or "n"), unless there are no unsaved changes.

1.5 OCR steps

The OCR process is divided into various steps, for instance "classification", "build", etc. These steps are acessible clicking the mouse button 3 over the OCR button. Each one can be started independently and/or repeated at any moment. In fact, the more you know about these steps, the better you'll use them.

Clicking the "OCR" button with the mouse button 1, all steps will be started in sequence. The "OCR" button remains on the "selected" state while some step is running.

Yet we won't cover this stuff in the tutorial, a basic knowledge on what each step perform is required for fine-tuning Clara OCR. The tuning is an interactive effort where the usage of the heuristics alternates with training and revision, guided by the user experience and feeling.

1.6 Classification

After training some symbols, we're ready to apply the just acquired knowledge to deduce the transliteration of non-trained symbols. For that, Clara OCR will compare the non-trained symbols with those trained ("patterns"). Clara OCR offers nice visual modes to present the comparison of each symbol with each pattern. To activate the visual modes, enter the View menu and select (for instance) the "show comparisons" option.

Now start the "classification" step (click the mouse button 3 over the OCR button and select the "classification" item) and observe what happens. Depending on your hardware and on the size of the document, this operation may take long to complete (e.g. 5 minutes). Hopefully it'll be much faster (say, 30 seconds).

When the classification finishes, observe that some nontrained symbols became black. Each such symbol was found similar to some pattern. Select one black symbol, and Clara will draw a gray ellipse around each class member (except the selected symbol, identified by the black graphic cursor). You can switch off this feature unselecting the "Show current class" item on the "View" menu.

In some cases, Clara will classify incorrectly some symbols. For instance, a defective "e" may be classified as "c". If that happens, you can inform Clara about the correct transliteration of that symbol training it as explained before (in this example, select the symbol and press "e"). This action will remove that symbol from its current class, and will define a new class, currently unitary and containing just that symbol.

1.7 Note about how Clara OCR classification works

The usual meaning of "classification" for OCRs is to deduce for each symbol if it is a letter "a" or the letter "b", or a digit "1", etc. As the total number of different symbols is small (some tenths), there will be a small quantity of classes.

However, instead of classifying each symbol as being the letter "a", or the digit "1", or whatever, Clara OCR builds classes of symbols with similar shapes, not necessarily assigning a transliteration for each symbol. So as sometimes the bitmap comparison heuristics consider two true letters "a" dissimilar (due to printing differences or defects), the Clara OCR classifier will brake the set of all letters "a" in various untransliterated subclasses.

Therefore, the classification result may be a much larger number of classes (thousands or more), not only because of those small differences or defects, but also because the classification heuristics are currently unable to scale symbols or to "boldfy" or "italicize" a symbol.

Note that each untransliterated subclass of letters "a" depends on a punctual human revision effort to become transliterated (trained). This is not an absurd strategy, because the revision of each subset corresponds to part of the unavoidable human revision effort required by any real-life digitalization project. This is one of the principles that make possible to see Clara OCR not as a traditional OCR, but as a productivity tool able to reduce costs. Anyway, we expect to the future more improvements on the Clara OCR classifier, in order to lessen the number of subclasses created.

1.8 Building the output

Now we're ready to build the OCR output. Just start the "build" step. The action performed will be basically to detect text words and lines, and output the transliterations, trained or deduced, of all symbols. The output will be presented on the "PAGE (output)" window.

Each character on the "PAGE (output)" window behaves like a HTML hyperlink. Click it to select the current symbol both on the "PAGE" window and on the "PAGE (symbol)" window. Note that the transliteration of unknow symbols is substituted by their internal IDs (for instance "[133]").

The result of the word detection heuristic can be visualized checking the "show words" item on the "View" menu.

1.9 Handling broken symbols

Obs. As to version 0.9.9 the merging heristics are only partially implemented, and in most cases they won't produce any effect.

The build heuristics also try to merge the pieces of broken symbols, just like the "u", the "h" and the "E" on the figure (observe the absent pixels). Some letters have thin parts, and depending on the paper and printing quality, these parts will brake more or less frequently.

XXX XXXXXXXXXXX XX XXX X XX XXX XX XXX XXX XXX XX XXX XXX X XX XX XXX X XXX XXXX XX XX XX XX XXX X XX XX XX XX XXX XX XX XX XX XXX XX XX XX XX XXX X XX XXXX XXXX XXX XXXXXXXXXXX

Clara OCR offers three symbol merging heuristics: geometric-based, recognition-based and learned. Each one may be activated or deactivated using the "tune" tab.

Geometric merging applies to fragments on the interior of the symbol bounding box, like the "E" on the figure, and to some other cases too.

The recognition merging searches unrecognized symbols and, for each one, tries to merge it with some neighbour(s), and checks if the result becomes similar to some pattern.

Finally, learned merging will try to reproduce the cases trained by the user. To train merging, just select the symbol using the mouse button 1 (say, the left part of the "u" on the figure), click the mouse button 3 on the fragment (the right part of the "u"), and select the "merge with current symbol" entry. On the other hand, the "disassemble" entry may be used to break a symbol into its components.

Obs. do not merge the "i" dot with the "i" stem. See the subsection "handling accents" for details.

1.10 Handling accents

Now let's talk about accents.

As a general rule, Clara OCR does not consider accents as parts of letters, so merging does not apply to accents. Accents are considered individual symbols, and must be trained separately. The "i" dot is handled as an accent. Clara OCR will compose accents with the corresponding letters when generating the output. The exception is when the accent is graphically joined to the letter:

XXX XX XXX XX XX XX XXXX XXXX XX XX XX XX XX XX XX XX XXXXXXXXXX XXXXXXXXXX XX XX XX XX XX XX XX XX XXXX XXXX

In the figure we have two samples of "e" letter with acute accent. In the first one, the accent is graphically separated from the letter. So the accent transliteration will be trained or deduced as being "'", the letter transliteration will be trained or deduced as beig "e". When generating the output, Clara OCR will compose them as the macro "\'e" (or as the ISO character 233, as soon as we provide this alternative behaviour).

On the second case the accent isn't graphically separable from the letter, so we'll need to train the accented character as the corresponding ISO character (code 233) or as the macro "\'e". As the generation of accented characters depend on the local X settings, the "Emulate deadkeys" item on the "Options" menu may be useful in this case. It will enable the composition of accents and letters performed directly by Clara OCR (like Emacs iso-accents-mode feature).

1.11 Browsing the book font

As explained earlier, trained symbols become patterns (unless you mark it "bad"). The collection of all patterns is called "book font" (the term "book" is to distinguish it from the GUI font). Clara OCR stores all pattern in the "patterns" file on the work directory, when the "save session" entry on the "File" menu is selected.

Clara OCR itself can choose the patterns and populate the book font. To do so, just select the "Build the font automatically" item on the "tune" tab, and classify the symbols.

To browse the patterns, click the "pattern" tab one or more times to enter the "Pattern (list)" window. The "PATTERN (list)" mode displays the bitmap and the properties of each pattern in a (perhaps very long) form. Click the "zoom" button to adjust the size of the pattern bitmaps. Use the scroolbar or the Next (Page Down) or Previous (Page Up) keys to navigate. Use the sort options on the "Edit" menu to change the presentation order.

Now press the "pattern" tab again to reach the "Pattern" window. It presents the "current" pattern with detailed properties. try activating the "show web clip" option on the "View" menu to visualize the pattern context. The left and right arrows will move to the previous and to the next patterns. To train the current pattern (being exhibited on the "Pattern" window), just press the key corresponding to its transliteration (Clara will automatically move to the next pattern) or fill the input field. There is no need to press ENTER to submit the input field contents.

1.12 Useful hints

If the GUI becomes trashed or blank, press C-l to redraw it.

By now, the GUI do not support cut-and-paste. To save to a file the contents of the "PAGE (list)" window, use the "Write report" item on the "File" menu.

The "OCR" button will enter "pressed" stated in some unexpected situations, like during dialogs. This behaviour will be fixed soon.

The "STOP" button do not stop immediately the OCR operation in course (e.g. classification). Clara OCR only stops the operation in course in "secure" points, where all data structures are consistent.

The OCR output is automatically saved to the file page.html (or page.txt if the option -o was used), where "page" is the name of the currently loaded page, without the "pbm" or "pgm" tag. This file is created by the "generate output" step on the menu that appears when the mouse button 3 is pressed over the OCR button.

Some OCR steps are currently unfinished and perform no action at all.

1.13 Fun codes

Clara OCR "fun codes" are similar to videogame "codes" (for those who have never heard about that, videogame "codes" are special sequences of mouse or key clicks that make your player invulnerable, or obtain maximum energy, or perform an unexpected action, etc).

The difference is that Clara OCR "fun codes" are not secret (videogame "codes" are normally secret and very hard to discover by chance). Clara OCR contains no secret feature. Fun codes are intended to be used along public presentations. By now there is only one fun code: just click one or more times the banner on the welcome window to make it scroll.

2. AVAILABILITY

Clara OCR is free software. Its source code is distributed under the terms of the GNU GPL (General Public License), and is available at http://www.claraocr.org/. If you don't know what is the GPL, please read it and check the GPL FAQ at http://www.gnu.org/copyleft/gpl-faq.html. You should have received a copy of the GNU General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free Software Foundation can be found at http://www.fsf.org.

3. CREDITS

Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG and not cited before follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G., Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (packager), Tim McNerney, Tyler Akins.