Reputation: 483
I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.
My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.
Does anybody know how to do it?
I already tried this:
locale mylocale("cs_CZ.utf-8");
mbstate_t mystate;
wstring mywstring = L"čřžýáí";
const codecvt<wchar_t,char,mbstate_t>& myfacet =
use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);
codecvt<wchar_t,char,mbstate_t>::result myresult;
size_t length = mywstring.length();
char* pstr= new char [length+1];
const wchar_t* pwc;
char* pc;
// translate characters:
myresult = myfacet.out (mystate,
mywstring.c_str(), mywstring.c_str()+length+1, pwc,
pstr, pstr+length+1, pc);
if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
cout << "Translation successful: " << pstr << endl;
else cout << "failed" << endl;
return 0;
which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.
Upvotes: 28
Views: 57059
Reputation: 70873
The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis 🚒). JWiesemann already pointed this out in their answer, but their code will only work on windows.
So here's a correct platform-independent version:
#include <codecvt>
#include <codecvt>
#include <string>
#include <type_traits>
std::string wstring_to_utf8(std::wstring const& str)
{
std::wstring_convert<std::conditional_t<
sizeof(wchar_t) == 4,
std::codecvt_utf8<wchar_t>,
std::codecvt_utf8_utf16<wchar_t>>> converter;
return converter.to_bytes(str);
}
std::wstring utf8_to_wstring(std::string const& str)
{
std::wstring_convert<std::conditional_t<
sizeof(wchar_t) == 4,
std::codecvt_utf8<wchar_t>,
std::codecvt_utf8_utf16<wchar_t>>> converter;
return converter.from_bytes(str);
}
On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in
#pragma warning(push)
#pragma warning(disable : 4996)
<the two functions>
#pragma warning(pop)
See this answer to another question as to why it's ok to disable that warning.
Upvotes: 1
Reputation: 31
On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like 😉 (U+1F609)
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
return myconv.to_bytes(str);
}
Upvotes: 3
Reputation: 41
You can use boost's utf_to_utf converter to get char format to store in std::string.
std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring);
Upvotes: 2
Reputation: 2956
The code below might help you :)
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
Upvotes: 98
Reputation: 66254
The Lexertl library has an iterator that lets you do this:
std::string str;
str.assign(
lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()),
lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));
Upvotes: -2
Reputation: 2373
What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.
To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.
Upvotes: 9
Reputation: 49850
C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString
class) or Qt (QString
class), both support Unicode, including UTF-8.
Upvotes: -10
Reputation: 36451
What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from wchar_t
not from char*
.
What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.
Plus when using (w)cout
/(w)cerr
/(w)cin
you need to imbue the locale on the stream.
Upvotes: -1