Reputation: 33
I am trying to convert cp1251 text to utf-8. What I am doing here is creating the buffer of hex numbers of given symbols in cp1251 to later convert those hex symbols to utf-8. The problem is that sometimes the converted string has some trash symbols in the end.
The output of converting the same string many times (203 ñòåï ÒÖÍÐ.466219.007 Èíòåðàêòèâíûé êîìïëåêñ NextPanel 43/NAUO1):
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1омплекс NextPanelВ 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO14V
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1sгцf¤№
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1п}тУї0bЊ.Z¶ї¬ЁЌ€/ГїаА Om›Ґї
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1sгцf¤№
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1™™™™Щ?
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
theNamePrefix = 203 степ ТЦНР.466219.007 Интерактивный комплекс NextPanel 43/NAUO1
char *ConvertWindows1251ToUtf8(string stringToConvert)
{
// string tmpString = stringToConvert.ToCString();
const char *tmpCharArray = stringToConvert.c_str();
vector<char> charBuffer;
char *buffer = new char[char_traits<char>::length(tmpCharArray)];
int latter = 0;
for (int i = 0; i < std::char_traits<char>::length(tmpCharArray); i++)
{
string tmpHexLatter;
int hexLatter = 0xFF & tmpCharArray[i];
stringstream ss;
ss << hex << hexLatter;
tmpHexLatter = ss.str();
if (hexLatter != 0xc3)
{
if (hexLatter != 0x30 && hexLatter != 0x31 && hexLatter != 0x32 && hexLatter != 0x33 && hexLatter != 0x34 && hexLatter != 0x35 && hexLatter != 0x36 && hexLatter != 0x37 && hexLatter != 0x38 && hexLatter != 0x39 && hexLatter != 0x61 && hexLatter != 0x62 && hexLatter != 0x63 && hexLatter != 0x64 && hexLatter != 0x65 && hexLatter != 0x66 && hexLatter != 0x67 && hexLatter != 0x68 && hexLatter != 0x69 && hexLatter != 0x6A && hexLatter != 0x6B && hexLatter != 0x6C && hexLatter != 0x6D && hexLatter != 0x6E && hexLatter != 0x6F && hexLatter != 0x70 && hexLatter != 0x71 && hexLatter != 0x72 && hexLatter != 0x73 && hexLatter != 0x74 && hexLatter != 0x75 && hexLatter != 0x76 && hexLatter != 0x77 && hexLatter != 0x78 && hexLatter != 0x79 && hexLatter != 0x7A && hexLatter != 0x41 && hexLatter != 0x42 && hexLatter != 0x43 && hexLatter != 0x44 && hexLatter != 0x45 && hexLatter != 0x46 && hexLatter != 0x47 && hexLatter != 0x48 && hexLatter != 0x49 && hexLatter != 0x4A && hexLatter != 0x4B && hexLatter != 0x4C && hexLatter != 0x4D && hexLatter != 0x4E && hexLatter != 0x4F && hexLatter != 0x50 && hexLatter != 0x51 && hexLatter != 0x52 && hexLatter != 0x53 && hexLatter != 0x54 && hexLatter != 0x55 && hexLatter != 0x56 && hexLatter != 0x57 && hexLatter != 0x58 && hexLatter != 0x59 && hexLatter != 0x5A && hexLatter != 0x2E)
hexLatter += 64;
if (hexLatter == 0x60)
hexLatter = 0xA0;
if (hexLatter == 0x6F)
hexLatter = 0x2F;
stringstream ss;
ss << hex << hexLatter;
string tmpHex = ss.str();
tmpHexLatter = "0x" + tmpHex;
latter = stoi(tmpHexLatter, {}, 16);
charBuffer.push_back((char)latter);
}
}
for (int i = 0; i < charBuffer.size(); i++)
{
buffer[i] = charBuffer[i];
}
return g_convert(buffer, -1, "utf-8", "Windows-1251", NULL, NULL, NULL);
/*string tmpStr = stringToConvert.ToCString();
std::unique_ptr<gchar, void (*)(gpointer)> p(g_convert(tmpStr.c_str(), -1, "utf-8", "Windows-1251", NULL, NULL, NULL), g_free);
return TCollection_AsciiString(p.get());*/
}
Upvotes: 0
Views: 1630
Reputation: 52344
If you're using linux, you have the iconv(3)
API available to convert between character encodings. Unfortunately, it's a C interface so it can be a bit ugly to use from C++, but it's still better than whatever you're trying to do with converting codepoints to hex strings.
I dug out an old wrapper class I wrote once to make iconv easier to use from C++ (C++17 or newer):
#include <cerrno>
#include <cstdlib>
#include <cstring>
#include <fstream>
#include <iostream>
#include <iterator>
#include <memory>
#include <string>
#include <string_view>
#include <iconv.h>
// Wrap iconv API in a RAII-styled class for ease of use
class iconv_wrapper {
private:
iconv_t cd;
public:
iconv_wrapper(std::string_view from, std::string_view to = "UTF-8")
: cd{iconv_open(to.data(), from.data())} {
if (cd == reinterpret_cast<iconv_t>(-1)) {
// Need better error handling
std::cerr << "Unable to open converter from " << from << " to " << to
<< ": " << std::strerror(errno) << '\n';
std::exit(EXIT_FAILURE);
}
}
~iconv_wrapper() noexcept { iconv_close(cd); }
std::string convert(std::string_view);
};
std::string iconv_wrapper::convert(std::string_view input) {
// Work out the maximum output size (Assuming converting from a
// single-byte encoding to UTF-8) and allocate a buffer and the
// other args needed for iconv
std::size_t insize = input.size();
std::size_t outsize = insize * 4;
std::size_t orig_outsize = outsize;
auto outbuf = std::make_unique<char[]>(outsize);
char *indata = const_cast<char *>(&input[0]);
char *outdata = &outbuf[0];
// Convert the input argument
std::size_t ret = iconv(cd, &indata, &insize, &outdata, &outsize);
if (ret == static_cast<std::size_t>(-1)) {
// Need better error handling
std::cerr << "Couldn't convert input data: " << std::strerror(errno)
<< '\n';
std::exit(EXIT_FAILURE);
}
// And return it
return std::string(outbuf.get(), orig_outsize - outsize);
}
int main(int argc, char **argv) {
if (argc != 2) {
std::cerr << "Usage: " << argv[0] << " cp1251-encoded-file\n";
return EXIT_FAILURE;
}
std::ifstream in{argv[1], std::ios_base::in | std::ios_base::binary};
if (!in) {
std::cerr << "Unable to open " << argv[1] << " for reading!\n";
return EXIT_FAILURE;
}
iconv_wrapper conv{"CP1251"};
std::string input{std::istreambuf_iterator<char>{in},
std::istreambuf_iterator<char>{}};
std::cout << conv.convert(input);
return 0;
}
Upvotes: 2
Reputation: 595827
You don't need buffer
at all, you can pass charBuffer.data()
or even stringToConvert.c_str()
to g_convert()
.
But, more importantly, bothbuffer
and charBuffer
are not null-terminated, and you are not otherwise passing the final length to g_convert()
, so g_convert()
will end up either reaching out of bounds, or try to convert uninitialized data, either way leading to undefined behavior, which is why you see garbage on the end of the result.
On a side note, you don't need the "0x"
prefix when calling std::stoi()
with base=16
.
Also, why are you returning a char*
instead of a std::string
? Who is responsible for allocating and freeing the memory? You really should let std::string
handle that.
Upvotes: 2