Jonathan
Jonathan

Reputation: 7551

Capture spawned process stdout as unicode

In my C++/WinAPI code, I want to run some commands and capture their output. To test non-ASCII output, I renamed my network connection to Ethérnét אבג БбГгДд and run ipconfig. When running in command prompt, the output comes out correctly (visible when using a supporting font like Courier New):

C:\>ipconfig
Windows IP Configuration

Ethernet adapter Ethérnét אבג БбГгДд:
(...)

I tried to redirect the output to a pipe, following the example in this answer. But the byte array returned from ReadFile() is not unicode - it's encoded in CP_OEMCP (CP437 in my case), and so the Hebrew and Russian characters come out as '?'s. Since the characters are already lost, no further handling can restore them.

Obviously it's possible, since cmd in a console window does it. How can I do it?

Upvotes: 6

Views: 1381

Answers (2)

Harry Johnston
Harry Johnston

Reputation: 36308

It would seem that ipconfig produces Unicode output when it detects that the output device is the console, and ANSI output otherwise. This is likely to be a backwards-compatibility measure.

Most other built-in command-line tools are likely to either be ANSI-only or to behave in the same way as ipconfig, for the same reason. In Windows, command-line tools are meant, well, for use on the command line; programmers are discouraged from shelling out to them and parsing the output. Instead, you should use the corresponding APIs.

If you know which language you are expecting, you might be able to choose a code page that will preserve the content.

Added by @Jonathan: Undocumented: Turns out you can control the encoding of built-in commands using the environment variable OutputEncoding. I tested with ipconfig, but presumably it works with other built-in tools as well:

> for %e in ("" Unicode Ansi UTF8) do (set OutputEncoding=%~e& ipconfig >ipconfig-%~e.txt)
> (set OutputEncoding=  & ipconfig  1>ipconfig-.txt )
> (set OutputEncoding=Unicode  & ipconfig  1>ipconfig-Unicode.txt )
> (set OutputEncoding=Ansi  & ipconfig  1>ipconfig-Ansi.txt )
> (set OutputEncoding=UTF8  & ipconfig  1>ipconfig-UTF8.txt )

And indeed, ipconfig-*.txt are enconded as expected! Note that this is undocumented, but it does work for me.

Addendum: as of Windows 10 v1809, another alternative is to create a pseudoconsole.

Upvotes: 5

RbMm
RbMm

Reputation: 33706

console application can use different ways for output.

  • for console handle we can use WriteConsoleW for output already in UNICODE.
  • if we want use WriteConsoleA or WriteFile for console handle need first convert UNICODE text to multi-bytes by WideCharToMultiByte with CodePage := GetConsoleOutputCP()
  • if we have not UNICODE text initially for output (say UTF-8 or Ansi), need first convert it to UNICODE by MultiByteToWideChar (with CP_UTF8 or CP_ACP) and then already again convert it to multi-byte WideCharToMultiByte(GetConsoleOutputCP(), ..)

usual (by default) GetConsoleOutputCP() return same value as GetOEMCP(), so have the same effect in MultiByteToWideChar and WideCharToMultiByte as CP_OEMCP (this constant value is translated to GetOEMCP() )

when output handle is redirected to a file need only use WriteFile only. however application can write data to file in any format: UNICODE, Ansi (CP_ACP) , UTF-8 (CP_UTF8) etc. what is format will be used - very depend from concrete application. you can not full control this. usual you will receive multi-byte output in CP_OEMCP encoding. then you need decide how process it - faster of all you will be need first convert it to UNICODE and use unicode form. if you need Ansi - you will be need do else one conversion.

say if you try use pipe output in CP_OEMCP encoding with OutputDebugStringA - you got error (not readable) output for non english text. but after 2 conversions CP_OEMCP -> UNICODE -> CP_ACP you can correct displayed text with OutputDebugStringA but because OutputDebugStringW exist - here enough only to UNICODE convert

also some applications have special options for control output to file format. say ipconfig.exe looking for "OutputEncoding" Environment Variable and depended from it string value ("Unicode", "Ansi", "UTF-8") produce different output. by default (if this Environment Variable not exist or unknown value) CP_OEMCP used

example of pipe read procedure. assume that input data in CP_OEMCP encoding:

void OnRead(PVOID buf, ULONG cbTransferred)
{
    if (cbTransferred)
    {
        if (int len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, 0, 0))
        {
            PWSTR pwz = (PWSTR)alloca((1 + len) * sizeof(WCHAR));

            if (len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, pwz, len))
            {
                if (g_bUseAnsi)
                {
                    if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, 0, 0, 0, 0))
                    {
                        PSTR psz = (PSTR)alloca(cbTransferred + 1);

                        if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, psz, cbTransferred, 0, 0))
                        {
                            DoPrint(psz, cbTransferred, OutputDebugStringA);
                        }
                    }
                }
                else
                {
                    DoPrint(pwz, len, OutputDebugStringW);
                }
            }
        }
    }
}

// debugger can incomplete print too big buffer, so split it on small chunks
template<typename T> void DoPrint(T* p, ULONG len, void (WINAPI* fnOutput)(const T*))
{
    ULONG cb;
    T* q = p;
    do 
    {
        cb = min(len, 256);

        q = p + cb;

        T c = *q;

        *q = 0;

        fnOutput(p);

        *q = c;

        p = q;

    } while (len -= cb);
}

about your concrete case - ipconfig.exe used WriteConsoleW for output to console. as result it not depended from current system locale and can correct display multilanguage text. but another tools, like route.exe used WriteFile for ouput (both to console and file) and convert before this UNICODE text to multi-byte by WideCharToMultiByte(CP_OEMCP,..) - as result here will be problems, if try display characters which not exist in CP_OEMCP code page (current system locale). if you have CP437 - Hebrew and Russian characters will be lost if use UNICODE -> CP_OEMCP, need only direct ouput with unicode to console and file. are this possible - dependend from concrete application. for say route.exe this not possible. for ipconfig.exe this possible, because it always write to console in unicode format, and can write to file also in unicode or utf-8 if you set "OutputEncoding" to "Unicode" or "UTF-8"

Upvotes: -1

Related Questions