Reputation: 62145
Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?
I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.
So for my benefit (and I believe many others) can I get some input from people on the following:
I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.
Update: See this related question also asked on StackOverflow previously.
Upvotes: 12
Views: 905
Reputation: 25408
The .NET Framework uses Windows default encoding for storing strings, which turns out to be UTF-16. If you don't specify an encoding when you use most text I/O classes, you will write UTF-8 with no BOM and read by first checking for a BOM then assuming UTF-8 (I know for sure StreamReader
and StreamWriter
behave this way.) This is pretty safe for "dumb" text editors that won't understand a BOM but kind of cruddy for smarter ones that could display UTF-8 or the situation where you're actually writing characters outside the standard ASCII range.
Normally this is invisible, but it can rear its head in interesting ways. Yesterday I was working with someone who was using XML serialization to serialize an object to a string using a StringWriter
, and he couldn't figure out why the encoding was always UTF-16. Since a string in memory is going to be UTF-16 and that is enforced by .NET, that's the only thing the XML serialization framework could do.
So, when I'm writing something that isn't just a throwaway tool, I specify a UTF-8 encoding with a BOM. Technically in .NET you will always be accidentally Unicode aware, but only if your user knows to detect your encoding as UTF-8.
It makes me cry a little every time I see someone ask, "How do I get the bytes of a string?" and the suggested solution uses Encoding.ASCII.GetBytes()
:(
Upvotes: 3
Reputation: 81132
Rule of thumb: if you never munge or look inside a string and instead treat it strictly as a blob of data, you'll be much better off.
Even doing something as simple as splitting words or lowercasing strings becomes tough if you want to do it "the Unicode way".
And if you want to do it "the Unicode way", you'll need an awfully good library. This stuff is incredibly complex.
Upvotes: 2
Reputation: 18077
Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.
Here some interesting articles (besides Joel's article) on the subject:
A quote from the first article; Tips for using Unicode:
Upvotes: 9
Reputation: 118063
I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.
Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.
Upvotes: 4