OCR Deserves Recognition

OCRed text is editable, the text of scanned images is not. That much is sure. But some further advantages of real text files over bitmaps are now clear to us: text files are much more compact than images — even with efficient image compression —, and will take up less disk space, less bandwidth when traveling on a network etc. (In professional, large-volume contexts, this would mean that electronic archives cover larger collections of documents — which implies CD-ROMs or DVDs that contain more documents, smaller investments in RAID systems or jukeboxes etc.)

Redundant Arrays of Independent Disks (RAID) system            Optical jukeboxes

And as your documents are dematerialized, much physical space is won: a hard disk containing TBs of data hardly takes any space, file cabinets for paper documents on the other hand do. (Not to mention the safety issues: creating a backup of a file collection is no difficult task. But have you ever heard of a company that photocopies file cabinets full of paper, just to be on the safe side?)

And last but not least: text files are searchable. You can for instance create a “full-text” database of a document collection. Any word your texts contain can then be used as a search criterion, you are no longer limited to a number of keywords to retrieve data.

When a “full-text” database needs to be created from a collection of paper documents, OCR is simply the only solution. “Full-text” searching has become very popular the last years because of the Internet search engines. “Google” the search term “chaos theory” and the search engine gives you a list of all web pages that mention the search term.

Search executed with search engine

Of course, the Internet search engines only index HTML files that can be found on the world-wide web. But imagine for a second that you had to convert paper documents into text files to obtain these HTML pages in the first place... How useful would character recognition be to you!

The recognized documents are never proofread in such OCR applications: “full-text” searching in large-volume archives can cope with some “noise” and “silence”. (Documentary “noise” is when you find irrelevant, unwanted documents with your searches — when that happens on the Internet, you speak of “spam”! “Silence” is the flip side of the coin: you’re missing some documents that contain good information!) Important words are bound to occur more than once in a multipage document or your search engine can execute “fuzzy searches” to overcome typos.)

Not that it takes these highly specialized solutions to use such functionality: even the operating systems (“OSes”) nowadays dispose of search features that will look for words contained inside text documents...

Search executed with Windows 7        Search executed with the Mac OS search feature Spotlight

Back to top

Submit feedback

Pin it          Tweet                    

Previous pageNext section

The text is in the eye of the beholderThe intelligence of OCRBitmaps only take you so far90,000 kids on the blockOCR is the ultimate data cruncher!OCR deserves recognition

Home pageIntroScannersImagesHistoryOCRLanguagesAccuracyOutputBCRPen scannersSitemapSearchContact – Feedback