Correct Hebrew character sequence in C# and searchable PDFs

Question

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.

The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".

 string hebrewText = reader.ReadToEnd();
 Assert.AreEqual("מנבוצץז ", hebrewText);

The rasterized PDF has what I believe are the same characters, but in the opposite order.

enter image description here

The unit test fails with this message:

Expected: "מנבוצץז "

But was: " זץצובנמ"

Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.

Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?

Allon Guralnek · Accepted Answer

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.

Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).

Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:

יון

The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.

Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

Correct Hebrew character sequence in C# and searchable PDFs

Answers (2)

Related Questions