Reputation: 1616
I'm writing a browser for the iPhone.
I'm using
NSString* storyHTML = @"";
ASIHTTPRequest *request = [ASIHTTPRequest requestWithURL:url];
[request startSynchronous];
to download HTML. The problem is sometimes there is no encoding in the HTTP header, in which case the above code defaults to Latin-ISO.
In this case I can read up to the header in the HTML and find the meta tag that specifies the actual encoding. Which looks something like this:
<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />
The problem is there are a TON of possible encodings that can be found in the meta tag as seen here: http://www.iana.org/assignments/character-sets
I would need to some how convert one of those encoding strings into one of the constant encodings found in the NSString class:
enum {
NSASCIIStringEncoding = 1,
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5, ...
There must be a class that some how determines the encoding of HTML for you. Is there a way to look into UIWebView and see how they do it?
It seems like downloading HTML should be easy, what am I missing?
Thanks!
Upvotes: 2
Views: 521
Reputation: 1088
Just going to round-up my comments and add a few final words of advice into an answer.
From general usage, you can use the ASIHTTPRequest
-responseString, otherwise you can use the data itself and use your own logic to figure out what type of encoding (UTF8, UTF16, etc)
From the ASIHTTP website:
ASIHTTPRequest will attempt to read the text encoding of the received data from the Content-Type header. If it finds a text encoding, it will set responseEncoding to the appropriate NSStringEncoding. If it does not find a text encoding in the header, it will use the value of defaultResponseEncoding (this defaults to NSISOLatin1StringEncoding). > When you call [request responseString], ASIHTTPRequest will attempt to create a string from the data it received, using responseEncoding as the source encoding.
See also: Encoding issue with ASIHttpRequest
I would personally recommend taking the response data and just assuming the content can fit into UTF16 (or 8). Of course you could also use a regular-expression or HTML parser to grab the <meta>
tag inside the <head>
element, but if the response is in a weird content-type then you might not be able to find the string @"<head"
I would also use curl
from the CLI on your computer to see what content-types ASIHTTPRequest is fetching. If you run a command like
curl -I "http://www.google.com/"
You'll get the following response:
HTTP/1.1 200 OK
Date: Tue, 09 Aug 2011 20:05:00 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
It would appear almost all sites respond correctly with this header, and when they don't I think using UTF8 would be a great bet. Could you comment with the link of the site that was giving you the issue?
Upvotes: 1