bukzor
bukzor

Reputation: 38462

Windows: directly examine cp1252

Let me prefice this by saying: I am by no means a Windows programmer. Please help me by correcting any misunderstanding I may have.

My understanding is that Windows has both (legacy) single-byte string interfaces and modernized Unicode interfaces.

My goal is to closely examine the cp1252 as implemented in the Windows kernel. I'll start with Windows XP, but I plan to check as many versions as I can.

I'm going to make the output of such a program similar in format to: https://encoding.spec.whatwg.org/index-windows-1252.txt

My question is primarily: what Windows API functions would I use to accomplish the above task? I think it's mbstowcs_s.

Secondarily: Must I write C in order to examine the relevant interfaces? If so what compiler would I use? I think Visual Studio Express 2010 is a good match, but I can't find any (legitimate) place to download it.


For those that must know the X to my Y, there are two competing standards and implementations of cp1252. They differ only slightly but they do differ, and it's significant to me.

The WHATWG specifies, and all browsers implement this standard: https://encoding.spec.whatwg.org/index-windows-1252.txt

Microsoft specifies, and python implements this standard: http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

The difference is in the five non-printable characters. In the windows spec they're entirely undefined, so these bytes cannot be round-tripped through cp1252. In the WHATWG spec (and all browsers), these bytes map to non-printing characters of the same value, as in latin1, meaning that those bytes can round-trip successfully through cp1252.

I strongly suspect that Microsoft's implementation actually matches the WHATWG spec and browsers' implementations, rather than the spec they've published. This is what I'm trying to prove/disprove above.

Upvotes: 1

Views: 622

Answers (3)

abarnert
abarnert

Reputation: 365697

Your question doesn't really make any sense. You want to examine "the encoding" used by each version of Windows from 95 through 10.

But none of those versions of Windows have "an encoding". Every single one of them is configurable in the same way: it has a default system encoding, which is pre-configured by Microsoft, and a current user encoding, which is set by Microsoft or the system OEM but which the user can change. So, your test won't depend on Windows 95 vs. Windows 7, it'll depend on US Windows 95 from Microsoft with default settings vs. ES Windows 95 from Microsoft with default settings vs. US Windows 95 from HP with default settings vs. US Windows 95 from Microsoft with each of the 238 possible choices in the Control Panel etc.

Also, to generate the kind of file you're trying to generate, you don't need to touch any Win32 APIs. All you need to do is call any function that uses the configured system locale's character set to decode single-byte/multi-byte text to UTF-16/Unicode text. For example, from C, you can call one of the mbcstowcs family from the MSVCRT; from Python, you can call the decode method on a str (Python 2)/bytes (Python 3) object with sys.getdefaultencoding(); etc.

If you really want to use the system interfaces to test the same information, you can… but then you'll run into limitations of most of those interfaces. For example, you can CreateFileA to create a new file with an 8-bit name, then try to CreateFileW to open the same file with the corresponding 16-bit name and verify that it works… but then you can't test any of the illegal-for-filenames characters.

Finally, Microsoft has provided free C compilers for most if not all of those platforms, but some of them are long out of service, so I don't know if you can (at least legally) get them or not. But you can always use MinGW to set up a gcc-based toolchain. I don't know if the current versions still work on Win95, but if not, the old versions should still be available.

Upvotes: 1

bukzor
bukzor

Reputation: 38462

Using @abernert's help, I came up with this. In conclusion, Microsoft's spec doesn't match their implementation, as I suspected:

from ctypes import cdll, windll, c_char_p
c = cdll.msvcrt
k = windll.kernel32
LC_ALL = 0  # from locale.h
# reference: https://msdn.microsoft.com/en-US/library/x99tb11d.aspx
c.setlocale.restype = c_char_p
result = c.setlocale(LC_ALL, '.1252')
assert result == 'English_United States.1252', result

from ctypes import create_string_buffer
# cp1252 is classified as "multi-byte" by the msapi along with utf8
mb = create_string_buffer(1)
wc1 = create_string_buffer(2)
wc2 = create_string_buffer(2)

print 'IN | MSVC  KERN'
print '---+-----------'
for b in range(0x80, 0xA0):
    mb.value = chr(b)

    # reference: https://msdn.microsoft.com/en-us/library/yk02bkxb.aspx
    result = c.mbtowc(wc1, mb, 1)
    assert result == 1, result

    # reference:
    #     https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072.aspx
    result = k.MultiByteToWideChar(1252, 0, mb, 1, wc2, 1)
    assert result == 1, result

    print '%02X | %02X%02X  %02X%02X' % (
        ord(mb.value),
        # little-endian:
        ord(wc1.raw[1]), ord(wc1.raw[0]),
        ord(wc2.raw[1]), ord(wc2.raw[0]),
    )

Output: (tested on Windows XP, Vista, 7, 8.1)

IN | MSVC  KERN
---+-----------
80 | 20AC  20AC
81 | 0081  0081
82 | 201A  201A
83 | 0192  0192
84 | 201E  201E
85 | 2026  2026
86 | 2020  2020
87 | 2021  2021
88 | 02C6  02C6
89 | 2030  2030
8A | 0160  0160
8B | 2039  2039
8C | 0152  0152
8D | 008D  008D
8E | 017D  017D
8F | 008F  008F
90 | 0090  0090
91 | 2018  2018
92 | 2019  2019
93 | 201C  201C
94 | 201D  201D
95 | 2022  2022
96 | 2013  2013
97 | 2014  2014
98 | 02DC  02DC
99 | 2122  2122
9A | 0161  0161
9B | 203A  203A
9C | 0153  0153
9D | 009D  009D
9E | 017E  017E
9F | 0178  0178

Compare this with the spec that Microsoft registered with unicode.org:

0x80    0x20AC  #EURO SIGN
0x81            #UNDEFINED
0x82    0x201A  #SINGLE LOW-9 QUOTATION MARK
0x83    0x0192  #LATIN SMALL LETTER F WITH HOOK
0x84    0x201E  #DOUBLE LOW-9 QUOTATION MARK
0x85    0x2026  #HORIZONTAL ELLIPSIS
0x86    0x2020  #DAGGER
0x87    0x2021  #DOUBLE DAGGER
0x88    0x02C6  #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89    0x2030  #PER MILLE SIGN
0x8A    0x0160  #LATIN CAPITAL LETTER S WITH CARON
0x8B    0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C    0x0152  #LATIN CAPITAL LIGATURE OE
0x8D            #UNDEFINED
0x8E    0x017D  #LATIN CAPITAL LETTER Z WITH CARON
0x8F            #UNDEFINED
0x90            #UNDEFINED
0x91    0x2018  #LEFT SINGLE QUOTATION MARK
0x92    0x2019  #RIGHT SINGLE QUOTATION MARK
0x93    0x201C  #LEFT DOUBLE QUOTATION MARK
0x94    0x201D  #RIGHT DOUBLE QUOTATION MARK
0x95    0x2022  #BULLET
0x96    0x2013  #EN DASH
0x97    0x2014  #EM DASH
0x98    0x02DC  #SMALL TILDE
0x99    0x2122  #TRADE MARK SIGN
0x9A    0x0161  #LATIN SMALL LETTER S WITH CARON
0x9B    0x203A  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C    0x0153  #LATIN SMALL LIGATURE OE
0x9D            #UNDEFINED
0x9E    0x017E  #LATIN SMALL LETTER Z WITH CARON
0x9F    0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS

It's clear to me that the slots labeled UNDEFINED (bytes 81 8D 8F 90 and 9D) are not undefined, not errors, but decode to unprintable characters of equal ordinal, as they do in the WHATWG spec, below:

  0 0x20AC  € (EURO SIGN)
  1 0x0081   (<control>)
  2 0x201A  ‚ (SINGLE LOW-9 QUOTATION MARK)
  3 0x0192  ƒ (LATIN SMALL LETTER F WITH HOOK)
  4 0x201E  „ (DOUBLE LOW-9 QUOTATION MARK)
  5 0x2026  … (HORIZONTAL ELLIPSIS)
  6 0x2020  † (DAGGER)
  7 0x2021  ‡ (DOUBLE DAGGER)
  8 0x02C6  ˆ (MODIFIER LETTER CIRCUMFLEX ACCENT)
  9 0x2030  ‰ (PER MILLE SIGN)
 10 0x0160  Š (LATIN CAPITAL LETTER S WITH CARON)
 11 0x2039  ‹ (SINGLE LEFT-POINTING ANGLE QUOTATION MARK)
 12 0x0152  Œ (LATIN CAPITAL LIGATURE OE)
 13 0x008D   (<control>)
 14 0x017D  Ž (LATIN CAPITAL LETTER Z WITH CARON)
 15 0x008F   (<control>)
 16 0x0090   (<control>)
 17 0x2018  ‘ (LEFT SINGLE QUOTATION MARK)
 18 0x2019  ’ (RIGHT SINGLE QUOTATION MARK)
 19 0x201C  “ (LEFT DOUBLE QUOTATION MARK)
 20 0x201D  ” (RIGHT DOUBLE QUOTATION MARK)
 21 0x2022  • (BULLET)
 22 0x2013  – (EN DASH)
 23 0x2014  — (EM DASH)
 24 0x02DC  ˜ (SMALL TILDE)
 25 0x2122  ™ (TRADE MARK SIGN)
 26 0x0161  š (LATIN SMALL LETTER S WITH CARON)
 27 0x203A  › (SINGLE RIGHT-POINTING ANGLE QUOTATION MARK)
 28 0x0153  œ (LATIN SMALL LIGATURE OE)
 29 0x009D   (<control>)
 30 0x017E  ž (LATIN SMALL LETTER Z WITH CARON)
 31 0x0178  Ÿ (LATIN CAPITAL LETTER Y WITH DIAERESIS)

Upvotes: 2

abarnert
abarnert

Reputation: 365697

To answer your X question instead of your Y question:

You can't really ask how "Windows" handles what it calls "ANSI strings", because there are multiple different levels that handle them independently. It's a pretty good bet that they all do so in ways that are compatible… but your whole point is to avoid that pretty good bet and examine the truth directly.

I think you can safely assume that MultiByteToWideChar will give you the same results as calling SpamA vs. SpamW functions in the Win32 API. (If you can't even assume that, I think you'd really need to test every single function pair in the API to make sure they all have the same results…) You can pass CP_1252 directly, but I think passing CP_OEMCP on a system configured for 1252 is a better test of what you're asking. Or just do both.

It's plausible that MSVCRT (which handles providing an 8-bit-string-based standard C interface and large chunks of POSIX to portable programs, including CPython) has its own conversions. To verify that, call mbstowcs or one of its relatives.

I'm pretty sure the Win32 system layer handles ANSI strings the same way as the user layer, but you may want to search for an undocumented ZwMultiByteToWideChar or similar. And I think the kernel just doesn't handle ANSI strings anywhere—e.g., IIRC, when you write a filesystem driver, the only pathname interfaces are wide… but you may want to download the DDK and make sure I'm right about that.

I think the Explorer GUI shell relies on the Win32 layer to handle everything, and doesn't touch ANSI strings anywhere. The cmd.exe command-line shell only deals in Unicode (except when running DOS programs on Win9x)—but it's also a terminal, and as a terminal, it does actually deal with both ANSI and Unicode strings and map them. In particular, you can send either ANSI or Unicode console output and read either ANSI or Unicode console input. That's probably done via MultiByteToWideChar and friends, but I couldn't promise that. I think MSVCRT's stdin/out and wstdin/out and its DOS-conio-style getch/etc. and getwch/etc. functions just access these respective console APIs instead of translating in MSVCRT, but if you don't trust that, you can go around it and either get the native console streams or just call the Console I/O functions directly.

So, how do you write a test program for these things, without finding multiple out-of-support versions of Microsoft C++ compiler and an SDK for each OS? (And, even if you did, how could you be sure that later versions of the WinXP SDK weren't hiding problems from you that existed in XP itself?)

The answer is to just LoadLibrary and GetProcAddress the functions out of their respective DLLs at runtime. Which you can do from a program you just compile for one version of Windows.

Or, even more simply, just use Python, and use its ctypes module to access the functions out of the DLLs. Just make sure you explicitly create and pass LPSTR and LPWSTR buffers instead of passing str/bytes/unicode objects anywhere.


So ultimately, I think all you need is a 20-line Python script that uses ctypes to call MultiByteToWideChar out of KERNEL32.DLL or mbstowcs out of MSVCRT32.DLL or both.

Upvotes: 1

Related Questions