Reputation: 7143
I'm crawling some reading from the web and save them as utf8 txt:
const result = await page.evaluate(() => {
const title = document.querySelector('#chapter-title').innerText;
const content = document.querySelector('#chapter-content').innerText;
return title + "\n\n" + content
})
fs.writeFileSync(`./results/chapter${chapter}.txt`, `${result}`, 'utf8');
But some characters (mostly the accents) in their original (HTML) form are different from what they appear on the browser, and mess up my reading app.
Following is a screenshot of the same text: the first line is the result from crawling, the second line is by opening the page with a browser and select + copy the text manually:
It seems somehow the browsers intelligently "fixed" those text and changed into a character available in the font.
Since I don't know exactly what happened, my search couldn't yield any result.
What happened, and is there anyway I can format the crawled text into the readable form?
Upvotes: 0
Views: 815
Reputation: 7143
I have resolved the issue using String.Prototype.Normalize().
The characters from source HTML was in a mix of NFC and NFD form. It seems my text editors failed to combine characters with 2 accents or more, resulting in separate accents/squares. Using normalize(), well, normalized them all to NFC, solving the issue.
(Self-answered question cannot be accepted within 2 days, feel free to elaborate or add reference/comment on the issue as you see fit)
Upvotes: 1