royco
royco

Reputation: 5529

Create string from UTF-8 byte array?

Consider the emoji 😙. It's U+1F619 (decimal 128537). I believe it's UTF-8 byte array is 240, 159, 152, 151.

  1. Given the UTF-8 byte array, how can I display it? Do I create a std::string from the byte array? Are there 3rd party libraries which help?
  2. Given a different emoji, how can I get its UTF-8 byte array?

Target platform: Windows. Compiler: Visual C++ 2019. Just pasting 😙 into the Windows CMD prompt does not work. I tried chcp 65001 and Lucida as the font, but no luck.

I can do this on macOS or Linux if necessary, but I prefer Windows.

To clarify ... given a list of 400 bytes, how can I display the corresponding code points assuming UTF-8?

Upvotes: 0

Views: 1462

Answers (2)

metablaster
metablaster

Reputation: 2184

Here is sample code for experimenting with unicode, to convert unicode character/string and print it in console, it works just fine for a lot of unicode characters assuming you set correct locale, console code page, and perform adequate string conversion (if needed ex. char32_t, char16_t and char8_t need conversion).

except for the character you want to display its not that easy, running a test takes huge amount of time, this can be improved my modifying code bellow or by knowing details needed such as code page (likely not supported by windows), so feel free to experiment as long as it doesn't become boring ;)

Hint, it would be the best to add code to write to file, let it run and check after some hour the results in file. For this to work you'll need to put BOM mark into file, but not before file is opened as UTF encoded, you do this by wofstream::imbue() to specific locale, and for BOM it depends on endianess, it's UTF-X LE encoding scheme on Windows, where X is either 8, 16, or 32, write to file must be done with wcout wchar_t to be sucessful.

See code commenets for more info, and try to comment out/uncomment parts of code to see different and quicker results.

BTW. the point in this code is to try out all possible locales/code pages supported by sytem, until you see your smiley in the console or ulitmately fail

#include <climits>
#include <locale>
#include <iostream>
#include <sstream>
#include <Windows.h>
#include <string_view>
#include <cassert>
#include <cwchar>
#include <limits>
#include <vector>
#include <string>

#pragma warning (push, 4)
#if !defined UNICODE && !defined _UNICODE
#error "Compile as unicode"
#endif

#define LINE __LINE__
// NOTE: change desired default code page here (unused)
#define CODE_PAGE CP_UTF8


// Error handling helper method
void StringCastError()
{
    std::wstring error = L"Unknown error";

    switch (GetLastError())
    {
    case ERROR_INSUFFICIENT_BUFFER:
        error = L"A supplied buffer size was not large enough, or it was incorrectly set to NULL";
        break;
    case ERROR_INVALID_FLAGS:
        error = L"The values supplied for flags were not valid";
        break;
    case ERROR_INVALID_PARAMETER:
        error = L"Any of the parameter values was invalid.";
        break;
    case ERROR_NO_UNICODE_TRANSLATION:
        error = L"Invalid Unicode was found in a string.";
        break;
    default:
        break;
    };

    std::wcerr << error << std::endl;
}

// Convert multybyte to wide string
static std::wstring StringCast(const std::string& param, int code_page)
{
    if (param.empty())
    {
        std::wcerr << L"ERROR: param string is empty" << std::endl;
        return std::wstring();
    }

    DWORD flags = MB_ERR_INVALID_CHARS;
    //flags |= MB_USEGLYPHCHARS;
    //flags |= MB_PRECOMPOSED;

    switch (code_page)
    {
    case 50220:
    case 50221:
    case 50222:
    case 50225:
    case 50227:
    case 50229:
    case 65000:
    case 42:
        flags = 0;
        break;
    case 54936:
    case CP_UTF8:
        flags = MB_ERR_INVALID_CHARS; // or 0
        break;
    default:
        if ((code_page >= 57002) && (code_page <= 57011))
            flags = 0;
        break;
    }

    const int source_char_size = static_cast<int>(param.size());
    int chars = MultiByteToWideChar(code_page, flags, param.c_str(), source_char_size, nullptr, 0);

    if (chars == 0)
    {
        StringCastError();
        return std::wstring();
    }

    std::wstring return_string(static_cast<const unsigned int>(chars), 0);
    chars = MultiByteToWideChar(code_page, flags, param.c_str(), source_char_size, &return_string[0], chars);

    if (chars == 0)
    {
        StringCastError();
        return std::wstring();
    }

    return return_string;
}

// Convert wide to multybyte string
std::string StringCast(const std::wstring& param, int code_page)
{
    if (param.empty())
    {
        std::wcerr << L"ERROR: param string is empty" << std::endl;
        return std::string();
    }

    DWORD flags = WC_ERR_INVALID_CHARS;
    //flags |= WC_COMPOSITECHECK;
    flags |= WC_NO_BEST_FIT_CHARS;

    switch (code_page)
    {
    case 50220:
    case 50221:
    case 50222:
    case 50225:
    case 50227:
    case 50229:
    case 65000:
    case 42:
        flags = 0;
        break;
    case 54936:
    case CP_UTF8:
        flags = WC_ERR_INVALID_CHARS; // or 0
        break;
    default:
        if ((code_page >= 57002) && (code_page <= 57011))
            flags = 0;
        break;
    }

    const int source_wchar_size = static_cast<int>(param.size());
    int chars = WideCharToMultiByte(code_page, flags, param.c_str(), source_wchar_size, nullptr, 0, nullptr, nullptr);

    if (chars == 0)
    {
        StringCastError();
        return std::string();
    }

    std::string return_string(static_cast<const unsigned int>(chars), 0);

    chars = WideCharToMultiByte(code_page, flags, param.c_str(), source_wchar_size, &return_string[0], chars, nullptr, nullptr);

    if (chars == 0)
    {
        StringCastError();
        return std::string();
    }

    return return_string;
}

// Console code page helper to adjust console
bool SetConsole(UINT code_page)
{
    if (IsValidCodePage(code_page) == 0)
    {
        std::wcerr << L"Code page is not valid: " << LINE << std::endl;
    }
    else if (SetConsoleCP(code_page) == 0)
    {
        std::wcerr << L"Failed to set console input code page line: " << LINE << std::endl;
    }
    else if (SetConsoleOutputCP(code_page) == 0)
    {
        std::wcerr << L"Failed to set console output code page: " << LINE << std::endl;
    }
    else
    {
        return true;
    }

    return false;
}

std::vector<std::string> locales;

// System locale enumerator to get all locales installed on system
BOOL LocaleEnumprocex(LPWSTR locale_name, [[maybe_unused]] DWORD locale_info, LPARAM code_page)
{
    locales.push_back(StringCast(locale_name, static_cast<int>(code_page)));
    return TRUE;    // continue drilling
}

// System code page enumerator to try out every possible supported/installed code page on system
BOOL CALLBACK EnumCodePagesProc(LPTSTR page_str)
{
    wchar_t* end;
    UINT code_page = std::wcstol(page_str, &end, 10);

    char char_buff[MB_LEN_MAX]{};
    char32_t target_char = U'😙';

    std::mbstate_t state{};
    std::stringstream string_buff{};
    std::wstring wstr = L"";

    // convert UTF-32 to multibyte
    std::size_t ret = std::c32rtomb(char_buff, target_char, &state);

    if (ret == -1)
    {
        std::wcout << L"Conversion from char32_t failed: " << LINE << std::endl;
        return FALSE;
    }
    else
    {
        string_buff << std::string_view{ char_buff, ret };
        string_buff << '\0';

        if (string_buff.fail())
        {
            string_buff.clear();
            std::wcout << L"string_buff failed or bad line: " << LINE << std::endl;
            return FALSE;
        }

        // NOTE: CP_UTF8 gives good results, ex. CP_SYMBOL or code_page variable does not
        // To make stuff work, provide good code page
        wstr = StringCast(string_buff.str(), CP_UTF8 /* code_page */ /* CP_SYMBOL */);
    }

    // Try out every possible locale, this will take insane amount of time!
    // make sure to comment this range for out if you know the locale.
    for (auto loc : locales)
    {
        // locale used (comment out for testing)
        std::locale::global(std::locale(loc));

        if (SetConsole(code_page))
        {
            // HACK: put breakpoint here, and you'll see the string
            // is correctly encoded inside wstr (ex. mouse over wstr)
            // However it's not printed because console code page is likely wrong.
            assert(std::wcout.good() && string_buff.good());
            std::wcout << wstr << std::endl;

            // NOTE: commented out to avoid spamming the console, basically
            // hard to find correct code page if not impossible for CMD
            if (std::wcout.bad())
            {
                std::wcout.clear();
                //std::wcout << L"std::wcout Read/write error on i/o operation line:  " << LINE << std::endl;
            }
            else if (std::wcout.fail())
            {
                std::wcout.clear();
                //std::wcout << L"std::wcout Logical error on i/o operation line:  " << LINE << std::endl;
            }
        }
    }

    return TRUE;    // continue drilling
}

int main()
{
    // NOTE: can be also LOCALE_ALL, anything else than CP_UTF8 doesn't make sense here
    EnumSystemLocalesEx(LocaleEnumprocex, LOCALE_WINDOWS, static_cast<LPARAM>(CP_UTF8), 0);

    // NOTE: can also be CP_INSTALLED
    EnumSystemCodePagesW(EnumCodePagesProc, CP_SUPPORTED);

    // NOTE: following is just a test code to demonstrate these algorithms indeed work,
    // comment out 2 function above to test!
    std::mbstate_t state{};
    std::stringstream string_buff{};

    char char_buff[MB_LEN_MAX]{};

    // Test case for working char:
    std::locale::global(std::locale("ru_RU.utf8"));

    string_buff.clear();
    string_buff.str(std::string());

    // Russian (KOI8-R); Cyrillic (KOI8-R)
    if (SetConsole(20866))
    {
        char32_t char32_str[] = U"Познер обнародовал";

        for (char32_t c32 : char32_str)
        {
            std::size_t ret2 = std::c32rtomb(char_buff, c32, &state);

            if (ret2 == -1)
            {
                std::wcout << L"Conversion from char32_t failed line: " << LINE << std::endl;
            }
            else
            {
                string_buff << std::string_view{ char_buff, ret2 };
            }
        }

        string_buff << '\0';
        if (string_buff.fail())
        {
            string_buff.clear();
            std::wcout << L"string_buff failed or bad line:  " << LINE << std::endl;
        }

        std::wstring wstr = StringCast(string_buff.str(), CP_UTF8);
        std::wcout << wstr << std::endl;

        if (std::wcout.fail())
        {
            std::wcout.clear();
            std::wcout << L"std::wcout failed or bad line:  " << LINE << std::endl;
        }
    }
}

#pragma warning (pop)

Upvotes: 1

user9475097
user9475097

Reputation:

C++ has a simple solution to that.

#include <iostream>
#include <string>

int main(void) {
    std::string s = u8"😙"; /* use std::u8string in c++20*/
    std::cout << s << std::endl;
    return 0;
}

This will allow you to store and print any UTF-8 string.

Note that Windows command prompt is weird with this kind of stuff. It's better you use an alternative such as MSYS2.

Upvotes: 1

Related Questions