Reputation: 6046
Whereas most of the Unix/POSIX/etc world uses UTF-8 for text representation, Windows uses UTF-16LE.
Why is that? There are multiple folks who say the Windows APIs were written before UTF-8 (and even Unicode as we know it) existed (1, 2, 3), so UTF-16 (or even earlier, UCS-2) was the best they had, and that converting the existing APIs to UTF-8 would be a ridiculous amount of work.
But are there any official sources for these 2 claims? The official MSDN page for Unicode makes it seem like UTF-16 may even be desirable (though I don't myself agree):
These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.
Is there any official note (or an engineer who worked on the project) explaining the reasoning behind choosing UTF-16 and why Windows would/would not switch to UTF-8?
Disclaimer: I work for Microsoft.
Upvotes: 19
Views: 11093
Reputation: 6046
Raymond Chen actually has an "official" answer—or at least an answer from a Microsoft source (emphasis added):
Windows adopted Unicode before most other operating systems.[citation needed] As a result, Windows’s solutions to many problems differ from solutions adopted by those who waited for the dust to settle.¹ The most notable example of this is that Windows used UCS-2 as the Unicode encoding. This was the encoding recommended by the Unicode Consortium because Unicode 1.0 supported only 65536 characters.² The Unicode Consortium changed their minds five years later, but by then it was far too late for Windows, which had already shipped Win32s, Windows NT 3.1, Windows NT 3.5, Windows NT 3.51, and Windows 95, all of which used UCS-2.³
— The sad history of Unicode printf-style format specifiers in Visual C++
In other words, Remy Lebeau and AmigoJack were both right—Windows adopted Unicode before UTF-8 was recommended (or even existed?); at the time, UCS-2 was the standard, so that's what Windows chose.
By the time we developed the more efficient (and now more-common) UTF-8 standard, Windows had already shipped several versions, and it would be immensely impractical (if not impossible) to change.
Thanks to everyone who provided answers to this question! Since I was looking for an official source, I'm marking this as the answer (although I'm marking it as community wiki, since it is an amalgamation).Upvotes: 12
Reputation: 595402
Windows was one of the first Operating Systems to adopt Unicode. Back then, there was indeed no UTF-8 yet, and UCS-2 was the most common encoding used for Unicode. So Windows' initial Unicode support was based on UCS-2.
By the time Unicode outgrew UCS-2, and UTF-8 and UTF-16 became more popular, it was too late for Windows to change over to UTF-8 without breaking tons of existing code 1, however UTF-16 is backwards compatible with UCS-2, so Microsoft was able to switch to UTF-16 with minimal effort, and little-to-no changes to existing user code.
1: and now, 20-odd years later, in Windows 10, Microsoft is only just starting to really begin to support UTF-8 at the Win32 API layer, but that functionality is still experimental, has to be enabled manually by the user or on a per-application basis via app manifests, and typically requires changes to user code to take advantage of UTF8-enabled APIs rather than UTF16-based APIs.
Upvotes: 20
Reputation: 6099
By "world" you most likely mean everything: operating system (internally used encoding), executables (supported encodings), file formats (supported encodings), file systems (internally used encodings) and more.
Windows won't easily switch because i.e. essential file formats such as PE (used in EXE, DLL and whatnot) have resource strings that can only cope with codepoints in WORD
s. The format is already a patch on a patch on a patch, and adding yet another extension to it may be more annoying than just using binary resource blocks and cast them to UTF-8.
Since introducing Unicode in Windows its API was laid out to a WORD
per character; most ANSI versions of each function were only stubs to call the WIDE versions of that function. For UTF-8 it can't be forced and would break with all legacy code - a whole new API would be needed (or a third version for each function). Only few functions are "future ready" because you can tell them in which encoding the text comes (obviously such as MultiByteToWideChar()
).
NTFS stores every character in WORD
s, too (thus indirectly supporting UTF-16), and I can't see how that will change with just a new version of it - I rather bet a whole new file system will be introduced that obsoletes NTFS with at least the new feature of also storing all filenames in UTF-8.
Upvotes: 2