Reputation: 571
I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:
import Network.HTTP
openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest
display :: String -> IO ()
display = (>>= putStrLn) . openURL
Then, when I run display "http://www.piaotian.net/html/7/7430/"
on ghci, some strange characters appear; the first lines look like this:
<title>×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔĶÁ_Æ®ÌìÎÄѧ</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄѧ" />
<meta name="description" content="Æ®ÌìÎÄѧÍøÌṩ×ß½øÐÞÏÉ×îÐÂÕ½ÚÃâ·ÑÔĶÁ£¬Ç뽫×ß½øÐÞÏÉÕ½ÚĿ¼¼ÓÈëÊղط½±ãÏ´ÎÔĶÁ,Æ®ÌìÎÄѧС˵ÔĶÁÍø¾¡Á¦ÔÚµÚһʱ¼ä¸üÐÂС˵×ß½øÐÞÏÉ£¬Èç·¢ÏÖδ¼°Ê±¸üУ¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏÉ°æȨÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">
I also tried to download as a file as follows:
import Network.HTTP
openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest
downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL
But after downloading the file, it is like in the photo:
If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.
Any help on the reason behind this is well appreciated.
P.S.
The python code is as follows:
import urllib
urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)
theFic = file_path
And the file is all fine and good.
Upvotes: 15
Views: 1164
Reputation: 2909
If you just want to download the file, e.g. to look at later, you just need to use the ByteString interface. It would be better to use http-client
for this (or wreq
if you have some lens knowledge). Then you can open it in your browser, which will see that it is a gbk file. So far, you would just be transferring the raw bytes as lazy bytestring. If I understand it, that's all the python is doing. Encodings are not an issue here; the browser is handling them.
But if you want to view the characters inside ghci, for example, the main problem is that nothing will handle the gbk encoding by default the way the browser can. For that you need something like text-icu
and the underlying C libraries. The program below uses the http-client
library together with text-icu
- these are I think pretty much standard for this problem, though you could use the less powerful encoding
library for as much of the problem as we have seen it so far. It seems to work okay:
import Network.HTTP.Client -- http-client
import Network.HTTP.Types.Status (statusCode)
import qualified Data.Text.Encoding as T -- text
import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU.Convert as ICU -- text-icu
import qualified Data.Text.ICU as ICU
import qualified Data.ByteString.Lazy as BL
main :: IO ()
main = do
manager <- newManager defaultManagerSettings
request <- parseRequest "http://www.piaotian.net/html/7/7430/"
response <- httpLbs request manager
gbk <- ICU.open "gbk" Nothing
let txt :: T.Text
txt = ICU.toUnicode gbk $ BL.toStrict $ responseBody response
T.putStrLn txt
Here txt
is a Text
value, i.e. basically just the 'code points'. The last bit T.putStrLn txt
will use the system encoding to present the text to you. You can also explicitly handle the encoding with the functions in Data.Text.Encoding
or the more sophisticate material in text-icu
For example, if you want to save the text in the utf8 encoding, you would use T.encodeUtf8
So in my ghci the output looks like so
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>走进修仙最新章节,走进修仙无弹窗全文阅读_飘天文学</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
...
Does that look right? In my ghci, what I am seeing goes via utf8, since that's my system encoding, but note that the file is saying it is a gbk file, of course. If, then, you wanted to do some Text
transformation and then save it as an html file -- then of course you would need to make sure the charset mentioned inside the file matched the encoding you used to make the bytestring you write to a file.
You could also of course get this into the shape of a Haskell String
by replacing the last three lines with
let str :: String
str = T.unpack $ ICU.toUnicode gbk $ BL.toStrict $ responseBody response
putStrLn str
Upvotes: 6
Reputation: 52057
Since you said you are interested in just the links, there is no need to convert the GBK encoding to Unicode.
Here is a version which prints out all links like "123456.html" in the document:
#!/usr/bin/env stack
{- stack
--resolver lts-6.0 --install-ghc runghc
--package wreq --package lens
--package tagsoup
-}
{-# LANGUAGE OverloadedStrings #-}
import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import Text.HTML.TagSoup
import Data.Char
import Control.Monad
-- match \d+\.html
isNumberHtml lbs = (LBS.dropWhile isDigit lbs) == ".html"
wanted t = isTagOpenName "a" t && isNumberHtml (fromAttrib "href" t)
main = do
r <- get "http://www.piaotian.net/html/7/7430/"
let body = r ^. responseBody :: LBS.ByteString
tags = parseTags body
links = filter wanted tags
hrefs = map (fromAttrib "href") links
forM_ hrefs LBS.putStrLn
Upvotes: 1
Reputation: 52057
Here is an updated answer which uses the encoding
package
to convert the GBK encoded contents to Unicode.
#!/usr/bin/env stack
{- stack
--resolver lts-6.0 --install-ghc runghc
--package wreq --package lens --package encoding --package binary
-}
{-# LANGUAGE OverloadedStrings #-}
import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import qualified Data.Encoding as E
import qualified Data.Encoding.GB18030 as E
import Data.Binary.Get
main = do
r <- get "http://www.piaotian.net/html/7/7430/"
let body = r ^. responseBody :: LBS.ByteString
foo = runGet (E.decode E.GB18030) body
putStrLn foo
Upvotes: 6
Reputation: 52057
This program produces the same output as the curl command:
curl "http://www.piaotian.net/html/7/7430/"
Test with:
stack program > out.html
open out.html
(If not using stack
, just install the wreq
and lens
packages and execute with runhaskell
.)
#!/usr/bin/env stack
-- stack --resolver lts-6.0 --install-ghc runghc --package wreq --package lens --package bytestring
{-# LANGUAGE OverloadedStrings #-}
import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
main = do
r <- get "http://www.piaotian.net/html/7/7430/"
LBS.putStr (r ^. responseBody)
Upvotes: 1
Reputation: 27023
I'm pretty sure that if you use Network.HTTP
with the String
type, it converts bytes to characters using your system encoding, which is, in general, wrong.
This is only one of several reasons I don't like Network.HTTP
.
Your options:
Use the Bytestring
interface. It's more awkward for some reason. It'll also require you to decode the bytes to characters manually. Most sites give you an encoding in the response headers, but sometimes they lie. It's a giant mess, really.
Use a different http fetching library. I don't think any remove the messiness of dealing with lying encodings, but they at least don't make it more awkward to not use the system encoding incorrectly. I'd look into wreq or http-client instead.
Upvotes: 8