Mark
Mark

Reputation: 21

Reading in Russian characters (Unicode) using a basic_ifstream<wchar_t>

Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.

I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251). And buf is of type basic_string<wchar_t>

The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).

This all works perfectly with english characters...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...

Upvotes: 2

Views: 2266

Answers (4)

VitalyVal
VitalyVal

Reputation: 1350

I am not sure, but you can try to call setlocale(LC_CTYPE, "");

Upvotes: 0

Hans Passant
Hans Passant

Reputation: 941505

There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.

Upvotes: 1

Billy ONeal
Billy ONeal

Reputation: 106549

Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t> used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.

If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>, which just passes through data unchanged.

Upvotes: 0

Jerry Coffin
Jerry Coffin

Reputation: 490138

Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.

Upvotes: 1

Related Questions