Jack
Jack

Reputation: 16724

Invalid encoding sequence for enconding 'utf-8'

I'm using std.net.curl module for get HTML from remove web page. But I have some problems with encoding, that I have no idea how to fix. Some pages, like facebook.com I get the following error messages(run time):

std.net.curl.CurlException@/usr/include/d/dmd/phobos/std/net/curl.d(800): Invalid encoding sequence for enconding 'utf-8'
----------------
./foo(char[] std.net.curl._decodeContent!(char)._decodeContent(ubyte[], immutable(char)[])+0xf6) [0x812e6ba]
./foo(char[] std.net.curl._basicHTTP!(char)._basicHTTP(const(char)[], const(void)[], std.net.curl.HTTP)+0x28e) [0x80f89f6]
./foo(char[] std.net.curl.get!(std.net.curl.HTTP, char).get(const(char)[], std.net.curl.HTTP)+0x8f) [0x80f8737]
./foo(immutable(char)[] teste.get_html(immutable(char)[])+0x112) [0x80f0806]
./foo(_Dmain+0x5f) [0x80f06e3]
./foo(extern (C) int rt.dmain2.main(int, char**).void runMain()+0x14) [0x8138340]
./foo(extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate())+0x18) [0x8137e50]
./foo(extern (C) int rt.dmain2.main(int, char**).void runAll()+0x32) [0x8138382]
./foo(extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate())+0x18) [0x8137e50]
./foo(main+0x94) [0x8137e04]
/lib/libc.so.6(__libc_start_main+0xf3) [0xb7593003]

For google.com, I get(am I getting binary? how?)

�S��7�砱�y�����g�d��C���|��W��O�s��~����*6��@�4�&�A�J����r▒4=�FT�e�� [...]

For dlang.org it works fine.

The question is: What's the correct way to read it? independent of page encoding.

Here's my D code:

string get_html(string page) {
  auto client = HTTP(); 
  client.clearRequestHeaders();
  client.addRequestHeader("DNA", "1");
  client.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
  client.addRequestHeader("Accept-Encoding", "gzip, deflate");
  client.addRequestHeader("User-Agent", "Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1");
  client.addRequestHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

  return cast(string)get(page, client);
}

Thanks in advance.

Upvotes: 3

Views: 783

Answers (2)

dav1d
dav1d

Reputation: 6055

cast(string)get(page, client) fails for any non UTF-8 sequence.

Use the standalone get and post functions, these decode the contents according to the sent headers of the server and return valid UTF-8.

Upvotes: 4

Vladimir Panteleev
Vladimir Panteleev

Reputation: 25187

Does curl really support gzip and deflate encodings, and the ISO-8859-1 charset? Should you really be specifying those headers yourself, as opposed to letting curl itself declare the encodings and charsets it supports?

To answer your question:

The question is: What's the correct way to read it? independent of page encoding.

You look at the headers the server sends you, which contain the page encoding and charset, then you interpret the data according to those headers (e.g. calling zlib to ungzip or inflate the data, then translate the decompressed HTML to UTF-8).

Upvotes: 1

Related Questions