The Arabic Alphabet

Now, we’ll discuss the Arabic script in detail, because it constitutes lots of specific challenges for OCR software. (Some of these elements we already discussed...)

The Arabic alphabet supports a limited set of 28 symbols. (Some additional letters are used to represent sounds which do not occur in Arabic — such as “p” or “g” — but in foreign words.)

Arabic document

Historically speaking, the Arabic script is an imitation of handwriting. Most letters are joined to the following letter of the same word. Some combinations of letters form special “ligatures”. Only a few letters (“disjoined letters”), for instance alif (Arabic character alif) and the phonetic symbol hamza (Arabic character hamza), are not connected to the next letter in the word! (You can play around with the online Arabic keyboard if you want to give it a try!)

All the letters are lowercase. there are no uppercase letters in arabic!

Short vowels are (normally) not written — the Quran and language-learning books for children are an exception — but long vowels are.

The shape of a letter depends on its position in the word. Each letter has 4 shapes: one shape is used in the beginning of the word, a second shape is used when the character occurs in the middle of the word, and a third shape is used in final position. The fourth shape is used to write the isolated character.

Arabic characters with varying shapes depending on the position

Some characters can vary significantly, depending on their position inside a word. The “ha” letter has two alternatives for the medial position: that symbol can be written in 5 different ways, no less!

Arabic ha symbol with 5 shapes

Some letter shapes include dots; the position of the dot changes the meaning of the letter! These symbols are called “toothed” characters because the basic form — the red shape without the dots — looks like a tooth!

Toothed characters of the Arabic alphabet

Double-sounding letters are written single; the “shaddadiacritic can be added above the letter to mark it as doubled.

Arabic diacritic Shadda

The numerals from the Latin alphabet — ironically called “Arabic numerals” — are used alongside the numerals from the Arabic alphabet — these are called “Indian numerals”!

Arabic numerals

(Farsi (a.k.a. Persian), another language that uses the Arabic script, uses different symbols to represent the Indian numbers 4, 5 and 6.)

Arabic and Farsi numerals           Farsi document with numerals

Arabic uses two widely different font types. “Naskh” is the round, calligraphic script you find in books, newspapers etc. and on computers. “Kofi” (also called “Koufi” or “Kufic”) is easy to recognize, this script looks very “square”, not rounded!

Arabic Naskh script               Arabic Kofi, Koufi or Kufic script

Arabic Kofi, Koufi or Kufic typefaces

Arabic text is written on (base) lines, as is the Latin alphabet. Shallow letters rest on the line of writing. Tall letters do the exact same thing but are tall like a European “l”. Deep shapes start above the line of writing, swoop below it and then swoop up again.

Baseline of Arabic characters

To justify printed text placed in columns etc., the shape of some letters can be elongated (“Kashida” or “tatweel” words). It goes without saying that things are done very differently in the Latin script

Kashida - tatweel in Arabic text

Kashida - tatweel in Arabic text        Justification of Latin-based text

Arabic is written from right to left, just like Hebrew. However, both the Indian and Arabic numerals are written from left to right, and so are embedded phrases in Latin script!

Arabic text with numerals   (… $362.250)               Arabic text with Latin numerals   (… 1559 …)

Arabic text with english word

As complicated as OCR of the Arabic language may be, there’s certainly reason to develop such software. Arabic speakers are estimated at 280 million. Arabic is the official language of 27 countries, no less. (Spoken Arabic varies per region, but written Arabic (“Modern Standard Arabic”) is the same throughout the Arabic world.)

(If you want a full list, here it is (in alphabetic order): Algeria, Bahrain, Chad, Comoros, Djibouti, Egypt, Eritrea, the Gaza Strip (governed by the Palestinian Authority), Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, United Arabic Emirates (U.A.E.) (includes Dubai) and Yemen. And until July 2018, Arabic was an official language in Israel — 1 million Palestinians live there!)

And then there’s Farsi. This Indo-European language written in Arabic script is used in Iran. (Farsi is often called “Persian”.)

Farsi document

Farsi uses 32 characters (unlike Arabic which has a symbol set composed of 28 characters). Four symbols are unique for Farsi.

Farsi characters             Farsi characters

This we already know: Farsi uses different symbols to represent the Indian numbers 4, 5 and 6.

Arabic and Farsi numerals           Farsi document with numerals

Farsi does not use elongated letters (“tatweel” or “Kashida” words); extra spaces are added between the words to align the text in columnized newspapers, magazines etc.

Back to top

Submit feedback

Pin it          Tweet                    

Previous pageNext page

Which languages can OCR software read?The history of the alphabets – Latin alphabetLatin punctuationGreek alphabetCyrillic (Russian) alphabetHebrew alphabetArabic alphabetLet’s go East – Chinese alphabetJapanese alphabetKorean alphabetAsian punctuation

Home pageIntroScannersImagesHistoryOCRLanguagesAccuracyOutputBCRPen scannersSitemapSearchContact – Feedback