Max Frai
Max Frai

Reputation: 64276

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:

en: Couldn't your family afford a costume for you
  ru: Не ваша семья позволить себе костюм для вас

How do I open file:

ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
  std::getline(readFile, buffer);
  ...
}

The first trouble, there is some symbol before text 'en' (I saw this in debugger):

"en: least"

And another trouble is cyrillic symbols:

" ru: наименьший"

What's wrong?

Upvotes: 0

Views: 2952

Answers (4)

den bardadym
den bardadym

Reputation: 2815

i suppose that your os is windows. exists several ways simple:

  1. Use wchar_t, wstring, wifstream, etc.
  2. Use icu library
  3. Use other super puper library (them really many)

Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.

Note: for file printing you need to know what encoding use your file

Upvotes: 1

bobince
bobince

Reputation: 536389

there is some symbol before text 'en'

That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.

Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.

ru: наименьший

That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.

If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.

Upvotes: 3

bmargulies
bmargulies

Reputation: 100050

Use icu to convert the text.

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798676

Use libiconv to convert the text to a usable encoding after reading.

Upvotes: 0

Related Questions