adam
adam

Reputation: 23

how to convert text from CP437 encoding to UTF8 encoding?

In Windows, the value of the Unicode character ö (Latin small letter o with diaeresis) in the CP437 character set is 148.

In Linux, the byte value for ö in the UTF-8 encoding is:

-61(Hi Byte) 
-74(Lo Byte)
(unsigned value = 46787)

My Question is, how can I convert from 148 from CP437 to UTF-8 in C++ on Linux?

The detailed info for my problem lies here:

open() function in Linux with extended characters (128-255) returns -1 error

Temporary solution: C++11 supports the conversion to UTF-8 using codecvt_utf8

Upvotes: 0

Views: 9622

Answers (4)

rhapsodyv
rhapsodyv

Reputation: 71

I did a working code from @remy-lebeau response. I hope it helps.

std::string cp2UTF8(int codePage, const char* in, int inlen) {
    // first convert input from codePage to wide char
    int widelen = MultiByteToWideChar(codePage, 0, in, inlen, 0, 0);
    std::wstring wide(widelen, L'\0');
    MultiByteToWideChar(codePage, 0, in, inlen, &wide[0], widelen);
    // then convert wide char to utf8
    int utf8len = WideCharToMultiByte(CP_UTF8, 0, wide.data(), widelen, NULL, 0, NULL, NULL);
    std::string utf8(utf8len, '\0');
    WideCharToMultiByte(CP_UTF8, 0, wide.data(), widelen, &utf8[0], utf8len, NULL, NULL);
    return utf8;
}

Upvotes: 0

Ralph Bisschops
Ralph Bisschops

Reputation: 2518

It is not in C++, but you can also use bash to convert a file:

$ iconv -f CP437 -t UTF-8 input_file_name.txt -o output_file_name.txt

Upvotes: 2

adam
adam

Reputation: 23

I found this solution to Convert CP437 to UTF8. This works perfectly in LINUX

        BYTE high, low;
        WORD result;
        if (sCMResult.wChar > 0x80 && sCMResult.wChar <= 0x7ff)
        {
            low = (0xc0 | ((sCMResult.wChar >> 6) & 0x1f));
            high = (0x80 | (sCMResult.wChar & 0x3f));
            result = low | (high << 8);
        }

Full post can be found here

Upvotes: -1

Remy Lebeau
Remy Lebeau

Reputation: 595827

On Windows, you can use the Win32 MultiByteToWideChar() function to convert data from CP437 to UTF-16, and then use the WideCharToMultiByte() function to convert data from UTF-16 to UTF-8.

On Linux, you can use a Unicode conversion library, like libiconv or ICU (which are available for Windows, too).


In C++11 and later, you can use std::wstring_convert to:

  • convert from CP437 to either UTF-16 or UTF-32/UCS-4 (if you can get/make a codecvt for CP437, that is).

  • then, convert from UTF-16 or UTF-32/UCS-4 to UTF-8.

You can't use codecvt_utf8 to convert from CP437 to UTF-8 directly. It only supports conversions between:

  • UTF-8 and UCS-2 (not UTF-16!)

  • UTF-8 and UTF-32/UCS-4.

You have to use codecvt_utf8_utf16 for conversions between UTF-8 and UTF-16.

Or, you can use mbrtoc16() to convert CP437 to UTF-16 using a CP437 locale, and then use c16rtomb() to convert UTF-16 to UTF-8 using a UTF-8 locale (if your STL library implements a fix for DR488, otherwise c16rtomb() only supports UCS-2 and not UTF-16!).


Otherwise, just create your own CP437-to-UTF8 lookup table for the 256 possible CP437 bytes, and then do the conversion manually, one byte at a time.

Upvotes: 6

Related Questions