Reputation: 38462
Let me prefice this by saying: I am by no means a Windows programmer. Please help me by correcting any misunderstanding I may have.
My understanding is that Windows has both (legacy) single-byte string interfaces and modernized Unicode interfaces.
My goal is to closely examine the cp1252 as implemented in the Windows kernel. I'll start with Windows XP, but I plan to check as many versions as I can.
I'm going to make the output of such a program similar in format to: https://encoding.spec.whatwg.org/index-windows-1252.txt
My question is primarily: what Windows API functions would I use to accomplish the above task? I think it's mbstowcs_s.
Secondarily: Must I write C in order to examine the relevant interfaces? If so what compiler would I use? I think Visual Studio Express 2010 is a good match, but I can't find any (legitimate) place to download it.
For those that must know the X to my Y, there are two competing standards and implementations of cp1252. They differ only slightly but they do differ, and it's significant to me.
The WHATWG specifies, and all browsers implement this standard: https://encoding.spec.whatwg.org/index-windows-1252.txt
Microsoft specifies, and python implements this standard: http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
The difference is in the five non-printable characters. In the windows spec they're entirely undefined, so these bytes cannot be round-tripped through cp1252. In the WHATWG spec (and all browsers), these bytes map to non-printing characters of the same value, as in latin1, meaning that those bytes can round-trip successfully through cp1252.
I strongly suspect that Microsoft's implementation actually matches the WHATWG spec and browsers' implementations, rather than the spec they've published. This is what I'm trying to prove/disprove above.
Upvotes: 1
Views: 622
Reputation: 365697
Your question doesn't really make any sense. You want to examine "the encoding" used by each version of Windows from 95 through 10.
But none of those versions of Windows have "an encoding". Every single one of them is configurable in the same way: it has a default system encoding, which is pre-configured by Microsoft, and a current user encoding, which is set by Microsoft or the system OEM but which the user can change. So, your test won't depend on Windows 95 vs. Windows 7, it'll depend on US Windows 95 from Microsoft with default settings vs. ES Windows 95 from Microsoft with default settings vs. US Windows 95 from HP with default settings vs. US Windows 95 from Microsoft with each of the 238 possible choices in the Control Panel etc.
Also, to generate the kind of file you're trying to generate, you don't need to touch any Win32 APIs. All you need to do is call any function that uses the configured system locale's character set to decode single-byte/multi-byte text to UTF-16/Unicode text. For example, from C, you can call one of the mbcstowcs
family from the MSVCRT; from Python, you can call the decode
method on a str
(Python 2)/bytes
(Python 3) object with sys.getdefaultencoding()
; etc.
If you really want to use the system interfaces to test the same information, you can… but then you'll run into limitations of most of those interfaces. For example, you can CreateFileA
to create a new file with an 8-bit name, then try to CreateFileW
to open the same file with the corresponding 16-bit name and verify that it works… but then you can't test any of the illegal-for-filenames characters.
Finally, Microsoft has provided free C compilers for most if not all of those platforms, but some of them are long out of service, so I don't know if you can (at least legally) get them or not. But you can always use MinGW to set up a gcc-based toolchain. I don't know if the current versions still work on Win95, but if not, the old versions should still be available.
Upvotes: 1
Reputation: 38462
Using @abernert's help, I came up with this. In conclusion, Microsoft's spec doesn't match their implementation, as I suspected:
from ctypes import cdll, windll, c_char_p
c = cdll.msvcrt
k = windll.kernel32
LC_ALL = 0 # from locale.h
# reference: https://msdn.microsoft.com/en-US/library/x99tb11d.aspx
c.setlocale.restype = c_char_p
result = c.setlocale(LC_ALL, '.1252')
assert result == 'English_United States.1252', result
from ctypes import create_string_buffer
# cp1252 is classified as "multi-byte" by the msapi along with utf8
mb = create_string_buffer(1)
wc1 = create_string_buffer(2)
wc2 = create_string_buffer(2)
print 'IN | MSVC KERN'
print '---+-----------'
for b in range(0x80, 0xA0):
mb.value = chr(b)
# reference: https://msdn.microsoft.com/en-us/library/yk02bkxb.aspx
result = c.mbtowc(wc1, mb, 1)
assert result == 1, result
# reference:
# https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072.aspx
result = k.MultiByteToWideChar(1252, 0, mb, 1, wc2, 1)
assert result == 1, result
print '%02X | %02X%02X %02X%02X' % (
ord(mb.value),
# little-endian:
ord(wc1.raw[1]), ord(wc1.raw[0]),
ord(wc2.raw[1]), ord(wc2.raw[0]),
)
Output: (tested on Windows XP, Vista, 7, 8.1)
IN | MSVC KERN
---+-----------
80 | 20AC 20AC
81 | 0081 0081
82 | 201A 201A
83 | 0192 0192
84 | 201E 201E
85 | 2026 2026
86 | 2020 2020
87 | 2021 2021
88 | 02C6 02C6
89 | 2030 2030
8A | 0160 0160
8B | 2039 2039
8C | 0152 0152
8D | 008D 008D
8E | 017D 017D
8F | 008F 008F
90 | 0090 0090
91 | 2018 2018
92 | 2019 2019
93 | 201C 201C
94 | 201D 201D
95 | 2022 2022
96 | 2013 2013
97 | 2014 2014
98 | 02DC 02DC
99 | 2122 2122
9A | 0161 0161
9B | 203A 203A
9C | 0153 0153
9D | 009D 009D
9E | 017E 017E
9F | 0178 0178
Compare this with the spec that Microsoft registered with unicode.org:
0x80 0x20AC #EURO SIGN
0x81 #UNDEFINED
0x82 0x201A #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C 0x0152 #LATIN CAPITAL LIGATURE OE
0x8D #UNDEFINED
0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON
0x8F #UNDEFINED
0x90 #UNDEFINED
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201C #LEFT DOUBLE QUOTATION MARK
0x94 0x201D #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02DC #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C 0x0153 #LATIN SMALL LIGATURE OE
0x9D #UNDEFINED
0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON
0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS
It's clear to me that the slots labeled UNDEFINED
(bytes 81 8D 8F 90 and 9D) are not undefined, not errors, but decode to unprintable characters of equal ordinal, as they do in the WHATWG spec, below:
0 0x20AC € (EURO SIGN)
1 0x0081 (<control>)
2 0x201A ‚ (SINGLE LOW-9 QUOTATION MARK)
3 0x0192 ƒ (LATIN SMALL LETTER F WITH HOOK)
4 0x201E „ (DOUBLE LOW-9 QUOTATION MARK)
5 0x2026 … (HORIZONTAL ELLIPSIS)
6 0x2020 † (DAGGER)
7 0x2021 ‡ (DOUBLE DAGGER)
8 0x02C6 ˆ (MODIFIER LETTER CIRCUMFLEX ACCENT)
9 0x2030 ‰ (PER MILLE SIGN)
10 0x0160 Š (LATIN CAPITAL LETTER S WITH CARON)
11 0x2039 ‹ (SINGLE LEFT-POINTING ANGLE QUOTATION MARK)
12 0x0152 Œ (LATIN CAPITAL LIGATURE OE)
13 0x008D (<control>)
14 0x017D Ž (LATIN CAPITAL LETTER Z WITH CARON)
15 0x008F (<control>)
16 0x0090 (<control>)
17 0x2018 ‘ (LEFT SINGLE QUOTATION MARK)
18 0x2019 ’ (RIGHT SINGLE QUOTATION MARK)
19 0x201C “ (LEFT DOUBLE QUOTATION MARK)
20 0x201D ” (RIGHT DOUBLE QUOTATION MARK)
21 0x2022 • (BULLET)
22 0x2013 – (EN DASH)
23 0x2014 — (EM DASH)
24 0x02DC ˜ (SMALL TILDE)
25 0x2122 ™ (TRADE MARK SIGN)
26 0x0161 š (LATIN SMALL LETTER S WITH CARON)
27 0x203A › (SINGLE RIGHT-POINTING ANGLE QUOTATION MARK)
28 0x0153 œ (LATIN SMALL LIGATURE OE)
29 0x009D (<control>)
30 0x017E ž (LATIN SMALL LETTER Z WITH CARON)
31 0x0178 Ÿ (LATIN CAPITAL LETTER Y WITH DIAERESIS)
Upvotes: 2
Reputation: 365697
To answer your X question instead of your Y question:
You can't really ask how "Windows" handles what it calls "ANSI strings", because there are multiple different levels that handle them independently. It's a pretty good bet that they all do so in ways that are compatible… but your whole point is to avoid that pretty good bet and examine the truth directly.
I think you can safely assume that MultiByteToWideChar
will give you the same results as calling SpamA vs. SpamW functions in the Win32 API. (If you can't even assume that, I think you'd really need to test every single function pair in the API to make sure they all have the same results…) You can pass CP_1252
directly, but I think passing CP_OEMCP
on a system configured for 1252 is a better test of what you're asking. Or just do both.
It's plausible that MSVCRT (which handles providing an 8-bit-string-based standard C interface and large chunks of POSIX to portable programs, including CPython) has its own conversions. To verify that, call mbstowcs
or one of its relatives.
I'm pretty sure the Win32 system layer handles ANSI strings the same way as the user layer, but you may want to search for an undocumented ZwMultiByteToWideChar
or similar. And I think the kernel just doesn't handle ANSI strings anywhere—e.g., IIRC, when you write a filesystem driver, the only pathname interfaces are wide… but you may want to download the DDK and make sure I'm right about that.
I think the Explorer GUI shell relies on the Win32 layer to handle everything, and doesn't touch ANSI strings anywhere. The cmd.exe command-line shell only deals in Unicode (except when running DOS programs on Win9x)—but it's also a terminal, and as a terminal, it does actually deal with both ANSI and Unicode strings and map them. In particular, you can send either ANSI or Unicode console output and read either ANSI or Unicode console input. That's probably done via MultiByteToWideChar
and friends, but I couldn't promise that. I think MSVCRT's stdin/out and wstdin/out and its DOS-conio-style getch/etc. and getwch/etc. functions just access these respective console APIs instead of translating in MSVCRT, but if you don't trust that, you can go around it and either get the native console streams or just call the Console I/O functions directly.
So, how do you write a test program for these things, without finding multiple out-of-support versions of Microsoft C++ compiler and an SDK for each OS? (And, even if you did, how could you be sure that later versions of the WinXP SDK weren't hiding problems from you that existed in XP itself?)
The answer is to just LoadLibrary
and GetProcAddress
the functions out of their respective DLLs at runtime. Which you can do from a program you just compile for one version of Windows.
Or, even more simply, just use Python, and use its ctypes
module to access the functions out of the DLLs. Just make sure you explicitly create and pass LPSTR
and LPWSTR
buffers instead of passing str
/bytes
/unicode
objects anywhere.
So ultimately, I think all you need is a 20-line Python script that uses ctypes
to call MultiByteToWideChar
out of KERNEL32.DLL
or mbstowcs
out of MSVCRT32.DLL
or both.
Upvotes: 1