Reputation: 64276

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:

en: Couldn't your family afford a costume for you
ru: Не ваша семья позволить себе костюм для вас

How do I open file:

ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
  std::getline(readFile, buffer);
  ...
}

The first trouble, there is some symbol before text 'en' (I saw this in debugger):

"ï»¿en: least"

And another trouble is cyrillic symbols:

" ru: Ð½Ð°Ð¸Ð¼ÐµÐ½ÑŒÑˆÐ¸Ð¹"

What's wrong?

Upvotes: 0

Answers (4)

den bardadym

Reputation: 2815

i suppose that your os is windows. exists several ways simple:

Use wchar_t, wstring, wifstream, etc.
Use icu library
Use other super puper library (them really many)

Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.

Note: for file printing you need to know what encoding use your file

Upvotes: 1

bobince

Reputation: 536389

there is some symbol before text 'en'

That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.

Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.

ru: Ð½Ð°Ð¸Ð¼ÐµÐ½ÑŒÑˆÐ¸Ð¹

That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.

If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchars and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.

Upvotes: 3

bmargulies

Reputation: 100050

Use icu to convert the text.

Upvotes: 0

Ignacio Vazquez-Abrams

Reputation: 798676

Use libiconv to convert the text to a usable encoding after reading.

Upvotes: 0

Reading file with cyrillic

Answers (4)

Related Questions