Let’s Take Things Step by Step, Shall We?

A scanner generates an image of the paper document and the text is intelligently “extracted” from that image. But what really happens? Can we be more specific about the recognition process?

The document is read by your scanner. This device acts as the “eye” of your computer and sends it the image.
At this step, the document image is only a meaningless cloud of intense points, “pixels”, on a lighter background.

Intelligent binarization routines convert color and greyscale images into black-and-white images.

Black-and-white image after the binarization

Page analysis comes next: the zones of interest to be recognized are marked on the scanned page.
A page may contain a big title, several text columns, two photos, a table and a footer.

Text and graphic zones on a scanned image

The OCR software extracts text information from the black-and-white pixels of the selected zones: it recognizes the shapes and assigns characters. This is done in several steps.

Line segmentation consists of slicing a page of text into its different lines. This step also analyzes interline spacing, lineskew, drop letters, and separates touching lines.

The word segmentation isolates one word from another.

The character segmentation — it does not apply when word image decoding is used — separates the various letters of a word.
This step organizes the dots of a scanned image into characters. If the characters have the same width (“fixed” pitch), character segmentation is easy. The problem gets more interesting when the width of the letters depends on their shape (“proportional” pitch), when kerning and touching characters (“ligatures”) occur, and when dot matrix fonts — characters composed of several clouds of isolated dots — are used.

The actual character recognition extracts characteristics out of each isolated shape and assigns a symbol.

The stage is set, let’s now discuss the successive steps of the OCR process in detail!

Previous section — Next page

Let’s take things step by step, shall we? — Take us where the rainbow ends! — B is for binarize — What gets read and what doesn’t — Lines, lineskew and drop letters — Segmenting words and characters — Stylized fonts — Why is OCR software called omnifont? — What’s the role of linguistics in the OCR process?

Home page — Intro — Scanners — Images — History — OCR — Languages — Accuracy — Output — BCR — Pen scanners — Sitemap — Search — Feedback – Contact

Recognition steps	Color modes	Binarization
Page analysis	Line segmentation	Word-character segmentation
Stylized fonts	Omnifont OCR	The role of linguistics

Home page	Intro	Scanners	Images	History	OCR	Languages
Accuracy	Output	BCR	Pen scanners	Sitemap	Search	Contact – Feedback