How to parse HTML page with accented words (Spanish) without losing them?

Question

I'm reading an HTML web page that contains literal accented words (Spanish):

 
 
Web page

Título
Año
Ángel
¿por qué nos vamos?

I'm using HXT:

...
let doc = readDocument [ withValidate no
                       , withInputEncoding iso8859_1
                       , withParseHTML yes
                       , withWarnings no
                       , withEncodingErrors no
                       , withCurl []] url
...

Using the option

withInputEncoding utf8

discard those chars, getting as result the following words: Ttulo, Ao, ngel, por qu nos vamos? Using the option

withInputEncoding iso8859_1

convert those chars to strings, getting as result words like: Rom\225ntica, Man\180s, H\233ctor. Where \225, \180 or \233 are strings, not chars.

What is the best method/way/approach to properly manage this situation in HXT and get all words without modifications?

Thanks.

How to parse HTML page with accented words (Spanish) without losing them?

Answers (1)

Related Questions