Sage Mitchell
Sage Mitchell

Reputation: 1583

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.

The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".

 string hebrewText = reader.ReadToEnd();
 Assert.AreEqual("מנבוצץז ", hebrewText);

The rasterized PDF has what I believe are the same characters, but in the opposite order.

enter image description here

The unit test fails with this message:

Expected: "מנבוצץז "

But was: " זץצובנמ"

Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.

  1. Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
  2. Does any part of the .NET stack tamper with the direction of Hebrew strings?
  3. What about NUnit?
  4. Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
  5. Anything else I should know before deciding whether to "fix" this unit test?

Upvotes: 4

Views: 1653

Answers (2)

Allon Guralnek
Allon Guralnek

Reputation: 16121

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.

Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).

Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:

יון

The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.

Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

Upvotes: 4

Alexander R
Alexander R

Reputation: 2476

  1. I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
  2. More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
  3. NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
  4. I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
  5. Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

Upvotes: 1

Related Questions