The Linux Kitten
The Linux Kitten

Reputation: 157

How to parse HTML page with accented words (Spanish) without losing them?

I'm reading an HTML web page that contains literal accented words (Spanish):

<head> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
<title>Web page</title>
<body>
<p>Título</p>
<p>Año</p>
<p>Ángel</p>
<p>¿por qué nos vamos?</p>
</body>

I'm using HXT:

...
let doc = readDocument [ withValidate no
                       , withInputEncoding iso8859_1
                       , withParseHTML yes
                       , withWarnings no
                       , withEncodingErrors no
                       , withCurl []] url
...

Using the option

withInputEncoding utf8

discard those chars, getting as result the following words: Ttulo, Ao, ngel, por qu nos vamos? Using the option

withInputEncoding iso8859_1

convert those chars to strings, getting as result words like: Rom\225ntica, Man\180s, H\233ctor. Where \225, \180 or \233 are strings, not chars.

What is the best method/way/approach to properly manage this situation in HXT and get all words without modifications?

Thanks.

Upvotes: 2

Views: 155

Answers (1)

Yuras
Yuras

Reputation: 13876

I bet you already have everything you need

Prelude> putStrLn $ read "\"Rom\225ntica\""
Romántica

Looks like you are looking to result of show applied to the string, not the string itself? Note that print uses show:

Prelude> print (read "\"Rom\225ntica\"" :: String)
"Rom\225ntica"

Upvotes: 6

Related Questions