Reputation: 35
While web scraping, some of the text retrieved was broken, very similar with foreign text when the incorrect encoding is used. The problem is: the encoding seems to be correct: "UTF-8". Is there any way to fix the text, even though it is supposedly in the correct format? The chunk of code below is the problem reported here. Rstudio is configured with "UTF-8" encoding, and functions that changes the encoding used always returns even more gibberish. Thank you all in advance.
library(rvest)
url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"
title.news <- html_text(read_html(url) %>%
html_nodes('body') %>%
html_nodes('main') %>%
html_nodes('article') %>%
html_nodes('.block') %>%
html_nodes('h1'))
title.news <- trimws(gsub(pattern = '\\s+', ' ', title.news))
Encoding(title.news)
[1] "UTF-8"
title.news
[1] "Folhas da Manhã, da Tarde e da Noite se uniram sob um só tÃtulo, Folha de S.Paulo, há 60 anos"
#Desired Output: Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos
Upvotes: 2
Views: 654
Reputation: 579
Inspired by the answer of GBLucass:
library(rvest)
library(readr)
html.src <- read_file(url, locale = locale(encoding="UTF-8"))
html.parse <- read_html(url)
This reads the page to a string first with correct encoding and parses it from the string. It seems to circumvent the problem.
Upvotes: 1
Reputation: 35
Thank you all for your help! The following chunk solved the problem
library(rvest)
library(dplyr)
library(httr)
url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"
pagina.web <- iconv(readLines(url, encoding = 'UTF-8'), 'UTF-8', 'UTF-8', sub = '')
titulo.noticia <- read_html(paste0(pagina.web, collapse = '\n')) %>%
html_nodes('body') %>%
html_nodes('main') %>%
html_nodes('article') %>%
html_nodes('.block') %>%
html_nodes('h1') %>%
html_text()
titulo.noticia
Upvotes: 1
Reputation: 1601
The answer is here but I honestly don't know why it works.
To start with there are some misencoded lines which you can check with utf8::utf8_valid
library(rvest)
#> Loading required package: xml2
url <- "https://www1.folha.uol.com.br/poder/2020/01/folhas-da-manha-da-tarde-e-da-noite-se-uniram-sob-um-so-titulo-folha-de-spaulo-ha-60-anos.shtml"
lines <- readLines(url, warn = FALSE)
lines[!utf8::utf8_valid(lines)]
#> [1] " Esse trecho est\xe1 em produ\xe7\xe3o para dar suporte aos componentes de chamadas,"
#> [2] " p\xe1ginas serem republicadas."
#> [3] " Trecho de c\xf3digo adicionado para renomear legenda abaixo das publicdades,"
Which are comments in the page source html. Stripping them out makes the functions work as expected
lines <- readLines(url, warn = FALSE)
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('body') %>%
html_nodes('main') %>%
html_nodes('article') %>%
html_nodes('.block') %>%
html_nodes('h1') %>%
html_text() %>%
{trimws(gsub(pattern = '\\s+', ' ', .))}
#> [1] "Folhas da Manhã, da Tarde e da Noite se uniram sob um só título, Folha de S.Paulo, há 60 anos"
Created on 2020-01-13 by the reprex package (v0.3.0)
Upvotes: 1