Reputation: 8376
We are representing paths as boost::filesystem::path
, but in some cases other APIs are expecting them as const char *
(e.g., to open a DB file with SQLite).
From the documentation, path::value_type
is a wchar_t
under Windows. As far as I know, Windows wchar_t
are 2 bytes, UTF-16 encoded.
There is a string()
native observer that returns a std::string
, while stating:
If string_type is a different type than String, conversion is performed by cvt.
cvt
is initialized to a default constructed codecvt
. What is the behaviour of this default constructed codecvt?
There is this forum entry, that recommends to use an instance of utf8_codecvt_facet
as the cvt
value to portably convert to UTF-8. But it seems that this codecvt is actually to convert between UTF-8 and UCS-4, not UTF-16.
What would be the best way (and if possible portable) to obtain an std::string
representation of a path
, making sure to convert from the right wchar_t
encoding when necessary?
Upvotes: 13
Views: 7600
Reputation: 27756
cvt is initialized to a default constructed codecvt. What is the behaviour of this default constructed codecvt?
It uses the default locale for conversion to the locale-specific multi-byte character set. On Windows this locale normally corresponds to the regional settings in the control panel.
What would be the best way (and if possible portable) to obtain an std::string representation of a path, making sure to convert from the right wchar_t encoding when necessary?
The C++11 standard introduced std::codecvt_utf8_utf16
. Although it is deprecated as of C++17
, according to this paper it will be available "until a suitable replacement is standardized".
To use this facet, call the static function:
boost::filesystem::path::imbue(
std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );
After that all calls to path::string()
will convert from UTF-16 to UTF-8.
Another way is to use std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> >
to do the conversion only in some cases.
Complete example code:
#include <boost/filesystem.hpp>
#include <iostream>
#include <codecvt>
void print_hex( std::string const& path );
int main()
{
// Create UTF-16 path (on Windows) that contains the characters "ÄÖÜ".
boost::filesystem::path path( L"\u00c4\u00d6\u00dc" );
// Convert path using the default locale and print result.
// On a system with german default locale, this prints "0xc4 0xd6 0xdc".
// On a system with a different locale, this might fail.
print_hex( path.string() );
// Set locale for conversion from UTF-16 to UTF-8.
boost::filesystem::path::imbue(
std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );
// Because we changed the locale, path::string() now converts the path to UTF-8.
// This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
print_hex( path.string() );
// Another option is to convert only case-by-case, by explicitly using a code converter.
// This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > cvt;
print_hex( cvt.to_bytes( path.wstring() ) );
}
void print_hex( std::string const& path )
{
for( char c : path )
{
std::cout << std::hex << "0x" << static_cast<unsigned>(static_cast<unsigned char>( c )) << ' ';
}
std::cout << '\n';
}
Upvotes: 9