mike
mike

Reputation: 880

C++ ifstream and "umlauts"

I am having an issue with "umlauts" (letters ä, ü, ö, ...) and ifstream in C++.

I use curl to download an html page and ifstream to read in the downloaded file line by line and parse some data out of it. This goes well until I have a line like one of the following:

te="Olimpija Laibach - Tromsö";
te="Burghausen - Münster";

My code parses these lines and outputs it as the following:

Olimpija Laibach vs. Troms?
Burghausen vs. M?nster

Things like outputting umlauts directly from the code work:

cout << "öäü" << endl; // This works fine

My code looks somewhat like this:

ifstream fin("file");

while(!(fin.eof())) {
    getline(fin, line, '\n');
    int pos = line.find("te=");
    if(pos >= 0) {
         pos = line.find(" - ");
         string team1 = line.substr(4,pos-4);
         string team2 = line.substr(pos+3, line.length()-pos-6);
         cout << team1 << " vs. " << team2 << endl;
   }
}

Edit: The weird thing is that the same code (the only changed things are the source and the delimiters) works for another text input file (same procedure: download with curl, read with ifstream). Parsing and outputting a line like the following is no problem:

<span id="...">Fernwärme Vienna</span>

Upvotes: 3

Views: 2831

Answers (1)

James Kanze
James Kanze

Reputation: 153977

What's the locale embedded in fin? In the code you show, it would be the global locale, which if you haven't reset it, is "C".

If you're anywhere outside the Anglo-Saxon world—and the strings you show suggest that you are— one of the first things you do in main should be

std::locale::global( std::locale( "" ) );

This sets the global locale (and thus the default locale for any streams opened later) to the locale being using in the surrounding environment. (Formally, to an implementation defined native environment, but in practice, to whatever the user is using.) In "C" locale, the encoding is almost always ASCII; ASCII doesn't recognize Umlauts, and according to the standard, illegal encodings in input should be replaces with an implementation defined character (IIRC—it's been some time since I've actually reread this section). In output, of course, you're not supposed to have any unknown characters, so the implementation doesn't check for them, and the go through.

Since std::cin, etc. are opened before you have a chance to set the global locale, you'll have to imbue them with std::locale( "" ) specifically.

If this doesn't work, you might have to find some specific locale to use.

Upvotes: 2

Related Questions