Reputation: 2108
Such as this sentence:
عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC
Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?
Upvotes: 2
Views: 840
Reputation: 354386
In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.
This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a
is a LTR character while א
is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still (
in the text, even though you see )
); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.
Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.
I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.
Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.
Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.
Upvotes: 6
Reputation: 28999
The letters are stored in logical order; meaning that a sentence such as "Hello! Salaam!" is in fact stored with the letters in precisely that order.
In addition to that, however, certain unicode flags are also added to the text that inform the text layout engine that the "Salaam" part of the sentence should be reversed when displayed; so the final text layout becomes "Hello! maalaS!", as well it should be.
These flags are either set through natural BIDI classification; e.g. غ; or through use of the Unicode RTL and LTR markers, U+200E and U+200F.
If you pay attention, the cursor doesn't in fact jump strangely, it always follows logical character order.
Upvotes: 3