itgiawa
itgiawa

Reputation: 1616

Encoding not present in HTTP header, how to find it in HTML header? (iPhone)

I'm writing a browser for the iPhone.

I'm using

NSString* storyHTML = @"";
ASIHTTPRequest *request = [ASIHTTPRequest requestWithURL:url];
[request startSynchronous];

to download HTML. The problem is sometimes there is no encoding in the HTTP header, in which case the above code defaults to Latin-ISO.

In this case I can read up to the header in the HTML and find the meta tag that specifies the actual encoding. Which looks something like this:

<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />

The problem is there are a TON of possible encodings that can be found in the meta tag as seen here: http://www.iana.org/assignments/character-sets

I would need to some how convert one of those encoding strings into one of the constant encodings found in the NSString class:

 enum {
   NSASCIIStringEncoding = 1,
   NSNEXTSTEPStringEncoding = 2,
   NSJapaneseEUCStringEncoding = 3,
   NSUTF8StringEncoding = 4,
   NSISOLatin1StringEncoding = 5, ...

There must be a class that some how determines the encoding of HTML for you. Is there a way to look into UIWebView and see how they do it?

It seems like downloading HTML should be easy, what am I missing?

Thanks!

Upvotes: 2

Views: 521

Answers (2)

smdvlpr
smdvlpr

Reputation: 1088

Just going to round-up my comments and add a few final words of advice into an answer.


Comment 1:

From general usage, you can use the ASIHTTPRequest -responseString, otherwise you can use the data itself and use your own logic to figure out what type of encoding (UTF8, UTF16, etc)


Comment 2:

From the ASIHTTP website:

ASIHTTPRequest will attempt to read the text encoding of the received data from the Content-Type header. If it finds a text encoding, it will set responseEncoding to the appropriate NSStringEncoding. If it does not find a text encoding in the header, it will use the value of defaultResponseEncoding (this defaults to NSISOLatin1StringEncoding). > When you call [request responseString], ASIHTTPRequest will attempt to create a string from the data it received, using responseEncoding as the source encoding.


Comment 3

See also: Encoding issue with ASIHttpRequest


I would personally recommend taking the response data and just assuming the content can fit into UTF16 (or 8). Of course you could also use a regular-expression or HTML parser to grab the <meta> tag inside the <head> element, but if the response is in a weird content-type then you might not be able to find the string @"<head"

I would also use curl from the CLI on your computer to see what content-types ASIHTTPRequest is fetching. If you run a command like

curl -I "http://www.google.com/"

You'll get the following response:

HTTP/1.1 200 OK

Date: Tue, 09 Aug 2011 20:05:00 GMT

Expires: -1

Cache-Control: private, max-age=0

Content-Type: text/html; charset=ISO-8859-1

It would appear almost all sites respond correctly with this header, and when they don't I think using UTF8 would be a great bet. Could you comment with the link of the site that was giving you the issue?

Upvotes: 1

cduhn
cduhn

Reputation: 17918

Is there a way to look into UIWebView and see how they do it?

There is. UIWebView is a wrapper around WebKit, which is an open source project. You can check out the source code or browse it online.

Upvotes: 0

Related Questions