What Gets Read And What Doesn’t

What are page decomposition and page analysis?

So much for binarization. We now have a black-and-white image where the text becomes readable because the color background (if any) was removed.

But that doesn’t mean that the entire image has to be submitted to the OCR process. A scan may contain, say, a big title, several text columns, photos, a table and a footer. In any case, indicating several text windows is mandatory whenever a text is arranged in columns.

Full page with complex, columnized layout

How do you indicate the zones of interest to be recognized on the scanned page? There are varying methods of zoning images: automatic page analysis, manual windowing and the use of fixed layouts.

Page analysis does the job for you: the text windows, the graphics (photos, logos, artwork etc.) and tables are detected and sorted automatically.

With manual windowing, the user indicates which zones he wants to capture. You can draw rectangular and polygonal windows on a displayed image to indicate which text and table zones you want to recognize and which graphics you want to save.

You can also take the middle road: have the system detect the various blocks, then intervene to edit the zoning results by excluding certain blocks or columns from recognition, changing the sort order etc.

Finally, OCR software allows you to store specific zoning templates for future use. Such predefined window structures are particularly useful when documents with a similar page layout are recognized. When you scan the pages of a book, a zoning template comes in handy: you frame the actual text, excluding the headers and footers that indicate the chapter, page number, publication, possibly even the black borders generated by your scanner etc.

Back to top


What Are Page Decomposition and Page Analysis?

Page decomposition is the process whereby advanced OCR software breaks a page up in text blocks, graphics and tables — the zone types may vary per OCR package. That can be done manually and automatically.

In the latter case, we speak of “page analysis”. Page analysis implies that the various blocks as occur on a scanned page are detected and sorted by the software, not by the user. Ideally speaking, intelligent routines detect which zones contain text, which zones contain graphics and which contain tables.

Page analysis is particularly useful when columnized texts and documents with a complex page layout are recognized, because the user avoids all the more work when it comes to indicating the zones of interest.

Page analysis of a full-page document with a complex layout

(Or combine the best of two world: have the system detect the text blocks, then intervene to exclude certain blocks or columns from recognition, change the sort order, create new zones where the automatic page analysis did not yield optimal results etc.)

Page analysis is sophisticated to the point of detecting irregular shapes: it will for instance adequately frame a graphic embedded in text columns.

Page analysis of irregular shapes

Seems far fetched? Not really. Have a look at this image for instance: the OCR can only be successful when it can “work around” the frivolous graphic with the umbrella man in the middle.

Page analysis of page with complex layout

The supported zone types can vary according to the OCR software you’re using. But decent OCR software should at least support three zone types: text zones, graphic zones and table zones.

Graphic zones aren’t recognized but can be included in the recognized documents — just as you have illustrations in the source documents. Tables are recognized — but they get processed differently than text zones. Here too, the objective is to recreate the source document, which implies including the tables.

Advanced OCR software detects both “gridded” and “ungridded” tables. “Gridded” or “framed” tables have borders around the cells, “ungridded” tables don’t have any borders around the cells — they’re simply composed of text organized in columns.

Gridded table            Ungridded table

Most OCR software can limit recognition to a numeric mode. Thus, you exclude possible confusion between “O” and ‘0’, between “B” and ‘8’ etc. The numeric mode does include currency symbols such as, say, the dollar ($), pound (£), yen (¥) and the Euro (€) symbol. Things may vary somewhat per OCR package.

Control panel for the numeric recognition mode

State-of-the-art OCR software will even detect zones in “inverted” frames. Let’s say that half of the scanned page is a black or dark frame which contains a few text blocks (and possibly a table). Not to worry: the OCR software will look for the text (and table) zones inside the inverted frame!

Scanned page with an inverted frame          Page analysis inside an inverted frame

Actually, the user doesn ’t have to do anything: the OCR software automatically inverts these zones when they have to be recognized.

There may even be a special page decomposition algorithm for Asian documents: the interline spacing of Asian documents is in most cases bigger than in Western documents and the text is less “dense”. The words are made up of small icons, so-called “ideograms”, that could easily be seen as graphic zones in Western documents. Also, the text may run from top to bottom and from right to left, and the OCR software has to sort the text blocks accordingly.

Page analysis of a Chinese document

Hebrew and Arabic documents equally require special page analysis routines: these languages are written from right to left!

Page analysis of a Hebrew document

However, numbers are written from left to right, and so are embedded phrases in Latin script. In other words: when numbers and Latin words are inserted in Hebrew and Arabic texts, both the reading direction and the alphabet change in mid course!

Latin words embedded in Arabic document

Latin words embedded in Hebrew document

Back to top

Submit feedback

Pin it          Tweet                    

Previous pageNext page

Let’s take things step by step, shall we?Take us where the rainbow ends!B is for binarizeWhat gets read and what doesn’tLines, lineskew and drop lettersSegmenting words and charactersStylized fontsWhy is OCR software called omnifont?What’s the role of linguistics in the OCR process?

Home pageIntroScannersImagesHistoryOCRLanguagesAccuracyOutputBCRPen scannersSitemapSearchFeedback – Contact