htmlhaskellhtml-parsingcontent-typenon-ascii-characters

Reputation: 571

Why can Haskell not handle characters from a specific website?

I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

display :: String -> IO ()
display = (>>= putStrLn) . openURL

Then, when I run display "http://www.piaotian.net/html/7/7430/" on ghci, some strange characters appear; the first lines look like this:

<title>×ß½øÐÞÏÉ×îÐÂÕÂ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔÄ¶Á_Æ®ÌìÎÄÑ§</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕÂ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄÑ§" />
<meta name="description" content="Æ®ÌìÎÄÑ§ÍøÌá¹©×ß½øÐÞÏÉ×îÐÂÕÂ½ÚÃâ·ÑÔÄ¶Á£¬Çë½«×ß½øÐÞÏÉÕÂ½ÚÄ¿Â¼¼ÓÈëÊÕ²Ø·½±ãÏÂ´ÎÔÄ¶Á,Æ®ÌìÎÄÑ§Ð¡ËµÔÄ¶ÁÍø¾¡Á¦ÔÚµÚÒ»Ê±¼ä¸üÐÂÐ¡Ëµ×ß½øÐÞÏÉ£¬Èç·¢ÏÖÎ´¼°Ê±¸üÐÂ£¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏÉ°æÈ¨ÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">

I also tried to download as a file as follows:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL

But after downloading the file, it is like in the photo:

If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.

Any help on the reason behind this is well appreciated.

P.S.
The python code is as follows:

import urllib

urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)

theFic = file_path

And the file is all fine and good.

Upvotes: 15

Answers (5)

Michael

Reputation: 2909

If you just want to download the file, e.g. to look at later, you just need to use the ByteString interface. It would be better to use http-client for this (or wreq if you have some lens knowledge). Then you can open it in your browser, which will see that it is a gbk file. So far, you would just be transferring the raw bytes as lazy bytestring. If I understand it, that's all the python is doing. Encodings are not an issue here; the browser is handling them.

But if you want to view the characters inside ghci, for example, the main problem is that nothing will handle the gbk encoding by default the way the browser can. For that you need something like text-icu and the underlying C libraries. The program below uses the http-client library together with text-icu - these are I think pretty much standard for this problem, though you could use the less powerful encoding library for as much of the problem as we have seen it so far. It seems to work okay:

import Network.HTTP.Client                     -- http-client
import Network.HTTP.Types.Status (statusCode)
import qualified Data.Text.Encoding as T       -- text
import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU.Convert as ICU  -- text-icu
import qualified Data.Text.ICU as ICU
import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
  manager <- newManager defaultManagerSettings
  request <- parseRequest "http://www.piaotian.net/html/7/7430/"
  response <- httpLbs request manager
  gbk <- ICU.open "gbk" Nothing
  let txt :: T.Text
      txt = ICU.toUnicode gbk $ BL.toStrict $ responseBody response
  T.putStrLn txt

Here txt is a Text value, i.e. basically just the 'code points'. The last bit T.putStrLn txt will use the system encoding to present the text to you. You can also explicitly handle the encoding with the functions in Data.Text.Encoding or the more sophisticate material in text-icu For example, if you want to save the text in the utf8 encoding, you would use T.encodeUtf8

So in my ghci the output looks like so

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>走进修仙最新章节,走进修仙无弹窗全文阅读_飘天文学</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />

...

Does that look right? In my ghci, what I am seeing goes via utf8, since that's my system encoding, but note that the file is saying it is a gbk file, of course. If, then, you wanted to do some Text transformation and then save it as an html file -- then of course you would need to make sure the charset mentioned inside the file matched the encoding you used to make the bytestring you write to a file.

You could also of course get this into the shape of a Haskell String by replacing the last three lines with

let str :: String
    str = T.unpack $ ICU.toUnicode gbk $ BL.toStrict $ responseBody response
putStrLn str

Upvotes: 6

ErikR

Reputation: 52057

Since you said you are interested in just the links, there is no need to convert the GBK encoding to Unicode.

Here is a version which prints out all links like "123456.html" in the document:

#!/usr/bin/env stack
{- stack
  --resolver lts-6.0 --install-ghc runghc
  --package wreq --package lens
  --package tagsoup
-}

{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import Text.HTML.TagSoup
import Data.Char
import Control.Monad

-- match \d+\.html
isNumberHtml lbs = (LBS.dropWhile isDigit lbs) == ".html"

wanted t = isTagOpenName "a" t && isNumberHtml (fromAttrib "href" t)

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  let body = r ^. responseBody :: LBS.ByteString
      tags = parseTags body
      links = filter wanted tags
      hrefs = map (fromAttrib "href") links
  forM_ hrefs LBS.putStrLn

Upvotes: 1

ErikR

Reputation: 52057

Here is an updated answer which uses the encoding package to convert the GBK encoded contents to Unicode.

#!/usr/bin/env stack
{- stack
  --resolver lts-6.0 --install-ghc runghc
  --package wreq --package lens --package encoding --package binary
-}

{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import qualified Data.Encoding as E
import qualified Data.Encoding.GB18030 as E
import Data.Binary.Get

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  let body = r ^. responseBody :: LBS.ByteString
      foo = runGet (E.decode E.GB18030) body 
  putStrLn foo

Upvotes: 6

ErikR

Reputation: 52057

This program produces the same output as the curl command:

curl "http://www.piaotian.net/html/7/7430/"

Test with:

stack program > out.html
open out.html

(If not using stack, just install the wreq and lens packages and execute with runhaskell.)

#!/usr/bin/env stack
-- stack --resolver lts-6.0 --install-ghc runghc --package wreq --package lens --package bytestring
{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  LBS.putStr (r ^. responseBody)

Upvotes: 1

Carl

Reputation: 27023

I'm pretty sure that if you use Network.HTTP with the String type, it converts bytes to characters using your system encoding, which is, in general, wrong.

This is only one of several reasons I don't like Network.HTTP.

Your options:

Use the Bytestring interface. It's more awkward for some reason. It'll also require you to decode the bytes to characters manually. Most sites give you an encoding in the response headers, but sometimes they lie. It's a giant mess, really.
Use a different http fetching library. I don't think any remove the messiness of dealing with lying encodings, but they at least don't make it more awkward to not use the system encoding incorrectly. I'd look into wreq or http-client instead.

Upvotes: 8

Why can Haskell not handle characters from a specific website?

Answers (5)

Related Questions