How to convert from UTF-8 to ANSI using standard c++

I have some strings read from the database, stored in a char* and in UTF-8 format (you know, "á" is encoded as 0xC3 0xA1). But, in order to write them to a file, I first need to convert them to ANSI (can't make the file in UTF-8 format... it's only read as ANSI), so that my "á" doesn't become "á". Yes, I know some data will be lost (chinese characters, and in general anything not in the ANSI code page) but that's exactly what I need.

But the thing is, I need the code to compile in various platforms, so it has to be standard C++ (i.e. no Winapi, only stdlib, stl, crt or any custom library with available source).

Anyone has any suggestions?

Upvotes: 13

Views: 35664

Answers (5)

vikas pachisia
vikas pachisia

Reputation: 615

codecvt based conversions have been deprecated mostly in c++17 and might have been removed from c++20 or c++23.

It looks like there are no alternatives proposed yet but the usage of codecvt specialized converters including wstring_convert, wbuffer_convert might create issues and hence has been deprecated. It now depends on the application and consumer to decide to use this or Native APIs for conversions.

The current recommendation is to use Native APIs. For example, WideCharToMultiByte and MultiByteToWideChar in Windows.

Upvotes: 1

Mehdi Mohammadpour
Mehdi Mohammadpour

Reputation: 1130

#include <stdio.h>
#include <string>
#include <codecvt>
#include <locale>
#include <vector>

using namespace std;
std::string utf8_to_string(const char *utf8str, const locale& loc){
    // UTF-8 to wstring
    wstring_convert<codecvt_utf8<wchar_t>> wconv;
    wstring wstr = wconv.from_bytes(utf8str);
    // wstring to string
    vector<char> buf(wstr.size());
    use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
    return string(buf.data(), buf.size());
}

int main(int argc, char* argv[]){
    std::string ansi;
    char utf8txt[] = {0xc3, 0xa1, 0};

    // I guess you want to use Windows-1252 encoding...
    ansi = utf8_to_string(utf8txt, locale(".1252"));
    // Now do something with the string
    return 0;
}

Upvotes: 0

cdycdr
cdycdr

Reputation: 1

This should work:

#include <string>
#include <codecvt>

using namespace std::string_literals;

std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
  std::u32string wstr(str.size(), U'\0');
  std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
  return wcvt{}.to_bytes(wstr.data(),wstr.data() + wstr.size());
}

std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
  auto wstr = wcvt{}.from_bytes(str);
  std::string result(wstr.size(), '0');
  std::use_facet<std::ctype<char32_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', &result[0]);
  return result;
}

int main() {
  auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
  auto s1 = from_utf8(s0);
  auto s2 = to_utf8(s1);

  return 0;
}

For VC++:

#include <string>
#include <codecvt>

using namespace std::string_literals;

std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
  std::u32string wstr(str.size(), U'\0');
  std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
  return wcvt{}.to_bytes(
    reinterpret_cast<const int32_t*>(wstr.data()),
    reinterpret_cast<const int32_t*>(wstr.data() + wstr.size())
  );
}

std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
  auto wstr = wcvt{}.from_bytes(str);
  std::string result(wstr.size(), '0');
  std::use_facet<std::ctype<char32_t>>(loc).narrow(
    reinterpret_cast<const char32_t*>(wstr.data()),
    reinterpret_cast<const char32_t*>(wstr.data() + wstr.size()),
    '?', &result[0]);
  return result;
}

int main() {
  auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
  auto s1 = from_utf8(s0);
  auto s2 = to_utf8(s1);

  return 0;
}

Upvotes: 0

A few days ago, somebody answered that if I had a C++11 compiler, I could try this:

#include <string>
#include <codecvt>
#include <locale>

string utf8_to_string(const char *utf8str, const locale& loc)
{
    // UTF-8 to wstring
    wstring_convert<codecvt_utf8<wchar_t>> wconv;
    wstring wstr = wconv.from_bytes(utf8str);
    // wstring to string
    vector<char> buf(wstr.size());
    use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
    return string(buf.data(), buf.size());
}

int main(int argc, char* argv[])
{
    string ansi;
    char utf8txt[] = {0xc3, 0xa1, 0};

    // I guess you want to use Windows-1252 encoding...
    ansi = utf8_to_string(utf8txt, locale(".1252"));
    // Now do something with the string
    return 0;
}

Don't know what happened to the response, apparently someone deleted it. But, turns out that it is the perfect solution. To whoever posted, thanks a lot, and you deserve the AC and upvote!!

Upvotes: 15

Ulrich Eckhardt
Ulrich Eckhardt

Reputation: 17444

If you mean ASCII, just discard any byte that has bit 7 set, this will remove all multibyte sequences. Note that you could create more advanced algorithms, like removing the accent from the "á", but that would require much more work.

Upvotes: 1

Related Questions