Reputation: 16724
I'm using std.net.curl
module for get HTML from remove web page. But I have some problems with encoding, that I have no idea how to fix. Some pages, like facebook.com
I get the following error messages(run time):
std.net.curl.CurlException@/usr/include/d/dmd/phobos/std/net/curl.d(800): Invalid encoding sequence for enconding 'utf-8'
----------------
./foo(char[] std.net.curl._decodeContent!(char)._decodeContent(ubyte[], immutable(char)[])+0xf6) [0x812e6ba]
./foo(char[] std.net.curl._basicHTTP!(char)._basicHTTP(const(char)[], const(void)[], std.net.curl.HTTP)+0x28e) [0x80f89f6]
./foo(char[] std.net.curl.get!(std.net.curl.HTTP, char).get(const(char)[], std.net.curl.HTTP)+0x8f) [0x80f8737]
./foo(immutable(char)[] teste.get_html(immutable(char)[])+0x112) [0x80f0806]
./foo(_Dmain+0x5f) [0x80f06e3]
./foo(extern (C) int rt.dmain2.main(int, char**).void runMain()+0x14) [0x8138340]
./foo(extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate())+0x18) [0x8137e50]
./foo(extern (C) int rt.dmain2.main(int, char**).void runAll()+0x32) [0x8138382]
./foo(extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate())+0x18) [0x8137e50]
./foo(main+0x94) [0x8137e04]
/lib/libc.so.6(__libc_start_main+0xf3) [0xb7593003]
For google.com
, I get(am I getting binary? how?)
�S��7�砱�y�����g�d��C���|��W��O�s��~����*6��@�4�&�A�J����r▒4=�FT�e�� [...]
For dlang.org
it works fine.
The question is: What's the correct way to read it? independent of page encoding.
Here's my D code:
string get_html(string page) {
auto client = HTTP();
client.clearRequestHeaders();
client.addRequestHeader("DNA", "1");
client.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.addRequestHeader("Accept-Encoding", "gzip, deflate");
client.addRequestHeader("User-Agent", "Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1");
client.addRequestHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
return cast(string)get(page, client);
}
Thanks in advance.
Upvotes: 3
Views: 783
Reputation: 6055
cast(string)get(page, client)
fails for any non UTF-8
sequence.
Use the standalone get
and post
functions, these decode the contents according to the sent headers of the server and return valid UTF-8
.
Upvotes: 4
Reputation: 25187
Does curl really support gzip and deflate encodings, and the ISO-8859-1 charset? Should you really be specifying those headers yourself, as opposed to letting curl itself declare the encodings and charsets it supports?
To answer your question:
The question is: What's the correct way to read it? independent of page encoding.
You look at the headers the server sends you, which contain the page encoding and charset, then you interpret the data according to those headers (e.g. calling zlib to ungzip or inflate the data, then translate the decompressed HTML to UTF-8).
Upvotes: 1