uoakinci
uoakinci

Reputation: 47

Why chars become useless? libcurl c++ Utf-8 encoded html;

First of all sorry for my bad english. I have done my research but there isn't any related answers to solve my problem. I have understood and learnt about CodePages Utf 8 and other stuff about in c or c++, and also know that strings can hold utf8. My development machine winxp english with console codepage set to 1254 (windows turkish) and I can use turkish extended chars (İığşçüö) in std::string, count them and send them to mysqlpp api to write dbs. There is no problem. But when I want to use curl to fetch some html and write it to std::string my problem starts.

#include <iostream>
#include <windows.h>
#include <wincon.h>
#include <curl.h>
#include <string>
int main()
{
   SetConsoleCP(1254);
   SetConsoleOutputCP(1254);
   std::string s;
   std::cin>>s;
   std::cout<<s<<std::endl;
   return 0;
}

When I run these and type ğşçöüİı the output is the same ğşçöüİı;

#include <iostream>
#include <windows.h>
#include <wincon.h>
#include <curl.h>
#include <string.h>

size_t writer(char *data, size_t size, size_t nmemb, std::string *buffer);
{
   int res;
   if(buffer!=NULL)
   {
      buffer->append(data,size*nmemb);
      res=size*nmemb;
   }
   return res;
}
int main()
{
   SetConsoleOutputCP(1254);
   std::string html;
   CURL *curl;
   CURLcode result;
   curl=curl_easy_init();
   if(curl)
   {
      curl_easy_setopt(curl, CURLOPT_URL, "http://site.com");
      curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer);
      curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);
      result=curl_easy_perform(curl);
      if(result==CURLE_OK)
      {
         std::cout<<html<<std::endl;
      }
   }
   return 0;
}

When I compile and run;

if the html contains 'ı' prints out to cmd 'ı','ö' prints out 'Ķ', 'ğ' pirnts out 'ÄŸ', 'İ' prints out 'Ä˚' etc..

if I change the CodePage to 65000,

...
SetConsoleOutputCP(65000);//For utf8
...

Then result is the same so problem's cause isn't cmd CodePage.

Respond http headers indicates charset setted to utf-8 and html metadata is the same.

As I understood, source of problem is the function "writer" or "curl" itself. Incoming data parsed to chars so extended chars like ı,İ,ğ parsed to 2 chars and written to char array std::string with that way thus codepage equivalent of these half chars printing out or used anywhere in code(such as mysqlpp to write that string to db).

I dont know how to solve this or what to do in writer function or anywhere else. Am I thinking right? if so What can I do about this problem? Or is problem's source in elsewhere?

Im using mingw32 Windows Xp 32bit Code::Blocks ide.

Upvotes: 5

Views: 1820

Answers (2)

Mihai Nita
Mihai Nita

Reputation: 5787

The returned string is utf-8, so you should set the console code page to 65001 (as recommended by sth). Or convert the string to 1254 and use the 1254 code page for console output, as you did before.

Upvotes: 0

sth
sth

Reputation: 229864

The correct codepage for UTF-8 is 65001, not 65000.

Also, have you checked if setting the codepage succeeds? The SetConsoleOutputCP function indicates success or failure by its return value.

Upvotes: 1

Related Questions