MakoBuk
MakoBuk

Reputation: 464

Node.js convert string from ISO-8859-2 to UTF-8

When I am downloading page content by Node.js Request and the content is encoded by ISO-8859-2, it is impossible to convert it to UTF-8.

I am using node-iconv for it.

Code:

request('https://www.jakpsatweb.cz', function(err, resp, body){
    const title = regexToRetrieveTitle(body);
    const iconv = new Iconv('ISO-8859-2', 'UTF-8');
    const buffer = iconv.convert(title);
    console.log(buffer);
    console.log(buffer.toString('UTF8'));
})

Console:

<Buffer 52 65 6b 6c 61 6d 61 3a 20 6a 61 6b 20 66 75 6e 67 75 6a 65 20 77 65 62 6f 76 c4 8f c5 bc cb 9d 20 72 65 6b 6c 61 6d 61>
Reklama: jak funguje webovďż˝ reklama

Expected result:

Reklama: jak funguje webová reklama

Do anyone know where is problem?

EDIT:

For example I download THIS PAGE . I recognised ISO-8859-2 by meta tags (chrome browser also) and I need to convert the content of page and save to database. My Database is UTF-8 therefore I need to encode it.

Upvotes: 0

Views: 3383

Answers (2)

MakoBuk
MakoBuk

Reputation: 464

The problem is in Node.js request. There is encoding set to UTF8 by default. I had to set it to null and now everything works fine.

request({ uri: 'https://www.jakpsatweb.cz', encoding: null}, function(err, resp, body){
    .....
})

Upvotes: 2

Bruno Haible
Bruno Haible

Reputation: 1292

The conversion from ISO-8859-2 to UTF-8 worked fine. It was the input (the title variable) that has a wrong contents: The title contains the bytes EF BF BD. This means that the title was already UTF-8 encoded, but with a U+FFFD (REPLACEMENT CHARACTER) in the place where you would expect the letter á (LATIN SMALL LETTER A WITH ACUTE).

Now, the original web page https://www.jakpsatweb.cz/reklama/index.html is correctly encoded in ISO-8859-2 and also has the required charset declaration in the <head> section.

Therefore the problem must be in the software that downloads the web page (NodeJS) or the regexToRetrieveTitle function.

Upvotes: 1

Related Questions