Kot Shrodingera
Kot Shrodingera

Reputation: 105

Reading UTF-16 file in c++

I'm trying to read a file which has UTF-16LE coding with BOM. I tried this code

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main() {

  std::wifstream fin("/home/asutp/test");
  fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (!fin) {
    std::cout << "!fin" << std::endl;
    return 1;
  }
  if (fin.eof()) {
    std::cout << "fin.eof()" << std::endl;
    return 1;
  }
  std::wstring wstr;
  getline(fin, wstr);
  std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") != std::string::npos) {
    std::cout << "Found" << std::endl;
  } else {
    std::cout << "Not found" << std::endl;
  }

  return 0;
}

The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me

/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled

Not found

Process finished with exit code 0

I'm on Linux Mint 18.3 x64, Clion 2018.1

Tried

Upvotes: 5

Views: 10577

Answers (2)

Barmak Shemirani
Barmak Shemirani

Reputation: 31599

Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.

As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8 (std::codecvt_utf8_utf16)

std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, std::ios::end);
size_t size = (size_t)fin.tellg();

//skip BOM
fin.seekg(2, std::ios::beg);
size -= 2;

std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);

std::string utf8 = std::wstring_convert<
    std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);

Or
std::ifstream fin("utf16.txt", std::ios::binary);

//skip BOM
fin.seekg(2);

//read as raw bytes
std::stringstream ss;
ss << fin.rdbuf();
std::string bytes = ss.str();

//make sure len is divisible by 2
int len = bytes.size();
if(len % 2) len--;

std::wstring sw;
for(size_t i = 0; i < len;)
{
    //little-endian
    int lo = bytes[i++] & 0xFF;
    int hi = bytes[i++] & 0xFF;
    sw.push_back(hi << 8 | lo);
}

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8 = convert.to_bytes(sw);

Upvotes: 8

SChepurin
SChepurin

Reputation: 1854

Replace by this - std::wstring::npos (not std::string::npos) -, and your code must work :

...
 //std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") == std::wstring::npos) {
    std::cout << "Not Found" << std::endl;
  } else {
    std::cout << "found" << std::endl;
  } 

Upvotes: 0

Related Questions