OCR Deserves Recognition

OCRed text is editable, the text of scanned images is not. That much is sure. But some further advantages of real text files over bitmaps are now clear to us: text files are much more compact than images — even with efficient image compression —, and will take up less disk space, less bandwidth when traveling on a network etc. (In professional, large-volume contexts, this would mean that electronic archives cover larger collections of documents — which implies CD-ROMs or DVDs that contain more documents, smaller investments in RAID systems or jukeboxes etc.)

And as your documents are dematerialized, much physical space is won: a hard disk containing TBs of data hardly takes any space, file cabinets for paper documents on the other hand do. (Not to mention the safety issues: creating a backup of a file collection is no difficult task. But have you ever heard of a company that photocopies file cabinets full of paper, just to be on the safe side?)

And last but not least: text files are searchable. You can for instance create a “full-text” database of a document collection. Any word your texts contain can then be used as a search criterion, you are no longer limited to a number of keywords to retrieve data.

When a “full-text” database needs to be created from a collection of paper documents, OCR is simply the only solution. “Full-text” searching has become very popular the last years because of the Internet search engines. “Google” the search term “chaos theory” and the search engine gives you a list of all web pages that mention the search term.

Of course, the Internet search engines only index HTML files that can be found on the world-wide web. But imagine for a second that you had to convert paper documents into text files to obtain these HTML pages in the first place... How useful would character recognition be to you!

The recognized documents are never proofread in such OCR applications: “full-text” searching in large-volume archives can cope with some “noise” and “silence”. (Documentary “noise” is when you find irrelevant, unwanted documents with your searches — when that happens on the Internet, you speak of “spam”! “Silence” is the flip side of the coin: you’re missing some documents that contain good information!) Important words are bound to occur more than once in a multipage document or your search engine can execute “fuzzy searches” to overcome typos.)

Not that it takes these highly specialized solutions to use such functionality: even the operating systems (“OSes”) nowadays dispose of search features that will look for words contained inside text documents...

Previous page — Next section

The text is in the eye of the beholder — The intelligence of OCR — Bitmaps only take you so far — 90,000 kids on the block — OCR is the ultimate data cruncher! — OCR deserves recognition

Home page — Intro — Scanners — Images — History — OCR — Languages — Accuracy — Output — BCR — Pen scanners — Sitemap — Search — Contact – Feedback

Home page	Intro	Scanners	Images	History	OCR	Languages
Accuracy	Output	BCR	Pen scanners	Sitemap	Search	Contact – Feedback