Marek
Marek

Reputation: 10402

Space in a .NET string returned by string.Format does not match space declared in source code - multiple representations?

String returned by string.Format seems to use some strange encoding. Spaces contained in format string are represented using different byte values compared to spaces contained in strings declared in source code.

The following test case demonstrates the problem:

[Test]
public void FormatSize_Regression() 
{
  string size1023 = FileHelper.FormatSize(1023);
  Assert.AreEqual("1 023 Bytes", size1023);
}

Fails:

    String lengths are both 11. Strings differ at index 1.
    Expected: "1 023 Bytes"
    But was:  "1 023 Bytes"
    ------------^

FormatSize method:

public static string FormatSize(long size) 
{
  if (size < 1024)
     return string.Format("{0:N0} Bytes", size);
  else if (size < 1024 * 1024)
     return string.Format("{0:N2} KB", (double)((double)size / 1024));
  else
     return string.Format("{0:N2} MB", (double)((double)size / (1024 * 1024)));
}

From VS immediate window when breakpoint is set on the Assert line:

size1023
"1 023 Bytes"

System.Text.Encoding.UTF8.GetBytes(size1023)
{byte[12]}
    [0]: 49
    [1]: 194 <--------- space is 194/160 here? Unicode bytes indicate that space should be the 160. What is the 194 then?
    [2]: 160
    [3]: 48
    [4]: 50
    [5]: 51
    [6]: 32
    [7]: 66
    [8]: 121
    [9]: 116
    [10]: 101
    [11]: 115
System.Text.Encoding.UTF8.GetBytes("1 023 Bytes")
{byte[11]}
    [0]: 49
    [1]: 32  <--------- space is 32 here
    [2]: 48
    [3]: 50
    [4]: 51
    [5]: 32
    [6]: 66
    [7]: 121
    [8]: 116
    [9]: 101
    [10]: 115

System.Text.Encoding.Unicode.GetBytes(size1023)
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 160 <----------- 160,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0
System.Text.Encoding.Unicode.GetBytes("1 023 Bytes")
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 32 <----------- 32,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0

Question: How is this possible?

Upvotes: 2

Views: 1947

Answers (6)

Konamiman
Konamiman

Reputation: 50273

Maybe you could change the test string in the Assert.Equal method to use CultureInfo.CurrentCulture.NumberFormat.NumberGroupSeparator instead of a single space character?

Upvotes: 2

Jonathan van de Veen
Jonathan van de Veen

Reputation: 1016

First of all, all strings in .NET are Unicode, so getting UTF8 bytes is useless. Second of all, when comparing strings you should specify culture info and when using string.format you should use an IFormatProvider. This way you control what characters are used in these functions.

Upvotes: 0

J. Steen
J. Steen

Reputation: 15578

160 is a non breaking space, which sort of makes sense, cause you wouldn't want your number to be split between rows. But 194... Oh yeah. UTF8 doublebytes.

Upvotes: 1

Eamon Nerbonne
Eamon Nerbonne

Reputation: 48066

194, 160 is utf8 for codepoint 160: the non-breaking space - &nbsp; in html.

That makes sense, you don't want a single number to be considered several words.

In short, your test revealed a flawed assumption - great! However, in terms of a unit test, your test has issues; you should always include a CultureInfo object when converting to and from strings - otherwise your unit tests may fail depending on the logged-in user's culture settings. You expect a particular form of string formatting - make sure you explicitly state which CultureInfo you're expecting.

Upvotes: 2

Ruben
Ruben

Reputation: 15515

Unicode 160 in UTF8 is not represented by the single byte 160, but by two bytes. And without checking, I'd wager those to be 194 + 160.

In fact, any Unicode codepoint beyond 127 is represented by more than one byte.

And I guess that your CultureInfo uses a non-breaking space (160) as a thousands grouping separator, and not a simple space (32) like you type yourself.

Upvotes: 4

Jon Skeet
Jon Skeet

Reputation: 1500565

I suspect your current culture is using an interesting "thousands" separator - U+00A0, which is the non-breaking space character. That's not an entirely unreasonable thousands separator, to be honest... it means you shouldn't get text like this displayed:

The size of the file is 1
023 bytes.

Instead you'd get

The size of the file is
1 023 bytes.

On my box, I get "1,023" instead. Do you want your FormatSize method to use the current culture, or a specific one? If it's the current culture, you should probably make your unit test specify the culture. I have a couple of wrapper methods I use for this:

internal static void WithInvariantCulture(Action action)
{
    WithCulture(CultureInfo.InvariantCulture, action);
}

internal static void WithCulture(CultureInfo culture, Action action)
{
    CultureInfo original = Thread.CurrentThread.CurrentCulture;
    try
    {
        Thread.CurrentThread.CurrentCulture = culture;
        action();
    }
    finally
    {
        Thread.CurrentThread.CurrentCulture = original;
    }            
}

so I can run:

WithInvariantCulture(() =>
{
    // Body of test
};

etc.

If you want to test for the exact string you're getting, you can use:

Assert.AreEqual("1\u00A0023 Bytes", size1023);

Upvotes: 13

Related Questions