Reputation: 35

R encoding problem while web scraping - how to fix broken text?

While web scraping, some of the text retrieved was broken, very similar with foreign text when the incorrect encoding is used. The problem is: the encoding seems to be correct: "UTF-8". Is there any way to fix the text, even though it is supposedly in the correct format? The chunk of code below is the problem reported here. Rstudio is configured with "UTF-8" encoding, and functions that changes the encoding used always returns even more gibberish. Thank you all in advance.

library(rvest)

url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"

title.news <- html_text(read_html(url) %>%
    html_nodes('body') %>%
    html_nodes('main') %>%
    html_nodes('article') %>%
    html_nodes('.block') %>%
    html_nodes('h1'))

title.news <- trimws(gsub(pattern = '\\s+', ' ', title.news))

Encoding(title.news)
[1] "UTF-8"

title.news
[1] "Folhas da ManhÃ£, da Tarde e da Noite se uniram sob um sÃ³ tÃtulo, Folha de S.Paulo, hÃ¡ 60 anos"

#Desired Output: Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos

Upvotes: 2

Answers (3)

Jaromír Adamec

Reputation: 579

Inspired by the answer of GBLucass:

library(rvest)
library(readr)

html.src <- read_file(url, locale = locale(encoding="UTF-8"))
html.parse <- read_html(url)

This reads the page to a string first with correct encoding and parses it from the string. It seems to circumvent the problem.

Upvotes: 1

GBLucas

Reputation: 35

Thank you all for your help! The following chunk solved the problem

library(rvest)
library(dplyr)
library(httr)

url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"

pagina.web <- iconv(readLines(url, encoding = 'UTF-8'), 'UTF-8', 'UTF-8', sub = '')

titulo.noticia <- read_html(paste0(pagina.web, collapse = '\n')) %>%
  html_nodes('body') %>%
  html_nodes('main') %>%
  html_nodes('article') %>%
  html_nodes('.block') %>%
  html_nodes('h1') %>%
  html_text()

titulo.noticia

Upvotes: 1

pgcudahy

Reputation: 1601

The answer is here but I honestly don't know why it works.

To start with there are some misencoded lines which you can check with utf8::utf8_valid

library(rvest)
#> Loading required package: xml2

url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"
lines <- readLines(url, warn = FALSE)
lines[!utf8::utf8_valid(lines)]
#> [1] "  Esse trecho est\xe1 em produ\xe7\xe3o para dar suporte aos componentes de chamadas,"
#> [2] "  p\xe1ginas serem republicadas."                                                     
#> [3] "  Trecho de c\xf3digo adicionado para renomear legenda abaixo das publicdades,"

Which are comments in the page source html. Stripping them out makes the functions work as expected

lines <- readLines(url, warn = FALSE)
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('body') %>%
    html_nodes('main') %>%
    html_nodes('article') %>%
    html_nodes('.block') %>%
    html_nodes('h1') %>%
    html_text() %>%
    {trimws(gsub(pattern = '\\s+', ' ', .))}
#> [1] "Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos"

Created on 2020-01-13 by the reprex package (v0.3.0)

Upvotes: 1

R encoding problem while web scraping - how to fix broken text?

Answers (3)

Related Questions