Reputation: 571

C++ Character Encoding

This is my C++ Code where i'm trying to encode the received file path to utf-8.

#include <string>
#include <iostream>

using namespace std;
void latin1_to_utf8(unsigned char *in, unsigned char *out);
string encodeToUTF8(string _strToEncode);

int main(int argc,char* argv[])
{

// Code to receive fileName from Sockets
cout << "recvd ::: " << recvdFName << "\n";
string encStr = encodeToUTF8(recvdFName);
cout << "encoded :::" << encStr << "\n";
}

void latin1_to_utf8(unsigned char *in, unsigned char *out)
{
 while (*in)
 {
  if (*in<128)
  {
    *out++=*in++;
  }
  else
  {
    *out++=0xc2+(*in>0xbf);
    *out++=(*in++&0x3f)+0x80;
  }
 }
 *out = '\0';
}

string encodeToUTF8(string _strToEncode)
{
  int len= _strToEncode.length();
  unsigned char* inpChar = new unsigned char[len+1];
  unsigned char* outChar = new unsigned char[2*(len+1)];
  memset(inpChar,'\0',len+1);
  memset(outChar,'\0',2*(len+1));
  memcpy(inpChar,_strToEncode.c_str(),len);
  latin1_to_utf8(inpChar,outChar);
  string _toRet = (const char*)(outChar);
  delete[] inpChar;
  delete[] outChar;
  return _toRet;
 }

And the OutPut is

recvd ::: /Users/zeus/ÄÈÊÑ.txt  
encoded ::: /Users/zeus/AÌEÌEÌNÌ.txt

The above function latin1_to_utf8 is provided as an solution Convert ISO-8859-1 strings to UTF-8 in C/C++ , Looks like it works.[Answer is accepted]. So i think i must be making some mistake, but i'm not able to identify what it is. Can someone help me out with this , Please.

I have first posted this question in Codereview,but i'm not getting any answers out there. So sorry for the duplication.

Upvotes: 0

Answers (2)

Teodor Boyadzhiev

Reputation: 39

Do you use any platform or you build it on the top of std? I am sure that many people use such convertions and therefore there is library. I strongly recommend you to use the libraray, because the library is tested and usually the best know way is used.

A library which I found doing this is boost locale

This is standard. If you use QT I will recommend you to use the QT conversion library for this (it is platform independant)

In case you want to do it yourself (you want to see how it works or for any other reason) 1. Make sure that you allocate memory ! - this is very important in C,C++ . Since you use iostream use new to allocate memory and delete to release it (this is also important C++ won't figure out when to release it for sure. This is developer's job here - C++ is hardcore :D ) 2. Check that you allocate the right size of memory. I expect unicode to be larger memory (it encodes more symbols and sometimes uses large numbers). 3. As already mentioned above read from somewhere (terminal or file) but output in new file. After that when you open the file with text editor make sure you set the encoding to be utf-8 ( your text editor has to know how to interpretate the data)

I hope that helps.

Upvotes: 1

Ulrich Eckhardt

Reputation: 17444

You are first outputting the original Latin-1 string to a terminal expecting a certain encoding, probably Latin-1. You then transcode to UTF-8 and output it to the same terminal, which interprets it differently. Classic mojibake. Try the following with the output instead:

for(size_t i=0, len=strlen(outChar); i!=len; ++i)
    std::cout << static_cast<unsigned>(static_cast<unsigned char>(outChar[i])) << ' ';

Note that the two casts are to first get the unsigned byte value and then to get the unsigned value to keep the stream from treating it as a char. Note that your char might already be unsigned, but that's compile-dependent.

Upvotes: 0

C++ Character Encoding

Answers (2)

Related Questions