The Hebrew alphabet supports a limited character set of 27 symbols. Five letters have a final form that gets used when the character appears at the end of a word (“sofit”).

All the letters are uppercase — THERE ARE NO LOWERCASE LETTERS IN HEBREW!

The symbol set is composed of consonants: the vowels are not usually written (except in the Bible, poetry and books for children and foreign learners). THR R N LWRCS LTTRS N HBRW!

Hebrew is written from right to left, just like Arabic. (Let’s face it: Hebrew is not the easiest language to read...) !WRBH N SRTTL SCRWL N R RHT

However, Hebrew does not have any separate numerals. The standard Western numerals (1, 2, 3 etc.) are used instead. Furthermore, these (Latin) numbers are written from left to right, and so are embedded phrases in Latin script. In other words: when numbers and Latin words are inserted in Hebrew texts, both the reading direction and the alphabet change in mid course!

Hebrew text with Latin words

And that’s not the only challenge when you develop an OCR engine for Hebrew. Here’s another particular element you don’t find in the other (Latin, Greek, Cyrillic and Arabic) alphabets: Hebrew text is not written on lines. Rather the text hangs from a line above the letters! (In technical terms: the “base line” is above the characters, not under it!)

Line of Hebrew text with the base line indicated

