TripeHound
TripeHound

Reputation: 2960

Reading UTF8-encoded command-line in .NET (C#)

I want to be able to handle UTF8-encoded command-line parameters in a .NET console program I'm writing. Unfortunately, both the "args" array passed to the Main() function, and the Environment class members (CommandLine and GetCommandLineArgs()) have aleady been (incorrectly) converted into Unicode, seemingly by treating the command-line as single-byte extended-ASCII.

For example, U+2019 (closing single apostrophe) in UTF8 is 0xe2 0x80 0x99. In the 1252 code-pade, 0x80 is the Euro symbol (U+20ac) and 0x99 is the "TM" symbol (U+2122). (The 0xe2 is an "a" with circumflex, which is the U+00e2, so doesn't change) When I pass these three bytes in on the command-line, the "char" elements of the string are 0x00e2 0x20ac and 0x2122.

Is there a way of either telling .NET to interpret the command-line as UTF8, or get the raw, unprocessed command-line (which I can happily convert to a Unicode string)?

Update

(Following dletozeun's answer)

Windows does odd things -- especially if it's XP (which I was using when I first asked the question). It seems things behave differently whether you're trying to call the .NET command-line program from a batch file or direct from a command-prompt. There's possibly a very good reason™ for this, but I don't know it. Anyway, should it help anyone, here's what I've found:

Command Line

Opening a standard Command Prompt window and entering the following command:

UTF8Cmd.exe abc’def

where UTF8Cmd is a test program incorporating dletozeun's solution, and the middle characters are 0xe2, 0x80, 0x99 (the UTF8 bytes for U+2019 -- closing single apostrophe) produces the following output (showing the argument before and after dletozeun's code, both as a string and dumped in hex):

    Raw : "abcâ?Tdef"    61 62 63 e2 20ac 2122 64 65 66
    UTF8: "abc'def"      61 62 63 2019 64 65 66

showing that the original arguments (Raw) have been mangled from the 1252-code-page byte values into their Unicode equivalents, but that the code posted has converted them back to the correct value (U+2019).

Batch File

Unfortunately, putting just the above into a batch file does not work... a completely different mangling happens, producing:

    Raw : "abcÔÇÖdef"    61 62 63 d4 c7 d6 64 65 66
    UTF8: "abc???def"    61 62 63 fffd fffd fffd 64 65 66

the raw bytes have been mangled into something weird, which are presumably not valid UTF8, hence the fffds after processing.

However, @mvp's suggestion of using chcp 65001 first (and resetting it afterwards) now does make things work without needing dletozeun's code:

Active code page: 65001
    Raw : "abc’def"      61 62 63 2019 64 65 66
    UTF8: "abc�def"      61 62 63 fffd 64 65 66
Active code page: 850

I had tried this before, as noted in my comment below, but that was on an XP box which totally fails (it doesn't even appear to run the command, and leaves the command-prompt in a weird state). Trying just now in response to the answer -- on a Windows 7 box -- and the chcp 65001 command works as I hoped it would when I originally asked the question!

Upvotes: 2

Views: 3209

Answers (1)

dletozeun
dletozeun

Reputation: 138

I know it is late but I also just came into this problem and did not find any answer anywhere. I managed to find a solution, so here is what I did to handle UTF8 encoded characters in the arguments list:

    // Handle UTF8 encoded characters
    byte[] argBytes =  System.Text.Encoding.Default.GetBytes( System.String.Join( " ", System.Environment.GetCommandLineArgs() ) );
    string argString = System.Text.Encoding.UTF8.GetString( argBytes );
    string[] args = argString.Split( ' ' );

Upvotes: 3

Related Questions