what is the principle of displaying different languages(Arabic and English) together?

Question

Such as this sentence:

عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC

Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?

Joey · Accepted Answer

In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.

This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a is a LTR character while א is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still ( in the text, even though you see )); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.

Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.

I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.

Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.

Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.

what is the principle of displaying different languages(Arabic and English) together?

Answers (2)

Related Questions