anpami
anpami

Reputation: 888

Webscraping cyrillic letters - encoding issue with rvest

I try to scrape the Russian journal names at https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1, but I have issues with the encoding.

Instead of showing Автоматика и телемеханика, R displays Àâòîìàòèêà è òåëåìåõàíèêà.

Even the use of rvest::guess_encoding()'s first result does not work. I also tried read_html(nauka_url, encoding="UTF-8"), but received an error, telling me: "Input is not proper UTF-8, indicate encoding !"

Here is my code so far:

  nauka_url <- "https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1"

  nauka_encoding <- rvest::guess_encoding(nauka_url)

  nauka_page <- xml2::read_html(nauka_url, encoding=nauka_encoding[1,1])
  
  nauka_journals <- rvest::html_node(nauka_page, css='#wraps > div > div > div > div > div.block-themes-category.block-themes-category-elems')
  
  nauka_journal_names <- rvest::html_nodes(nauka_journals, css='.edition__title')
  nauka_journal_names <- rvest::html_text(nauka_journal_names)

How to obtain the correct Cyrillic letters? Thank you for your help!

Upvotes: 1

Views: 305

Answers (2)

QHarr
QHarr

Reputation: 84465

Rather than guess first inspect the response headers for the charset


library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)

headers <- httr::GET('https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1') %>% 
  httr::headers() %>% 
  .$`content-type`

print(str_match(headers, 'charset=(.*)')[1,2])
#> [1] "windows-1251"

Created on 2021-01-02 by the reprex package (v0.3.0)


Or query the page itself via console e.g.

enter image description here


Or, indeed check the instructions in the meta[charset] i.e. meta tag with charset attribute (not fool-proof) via elements tab of browser:

enter image description here

Upvotes: 3

Donald Seinen
Donald Seinen

Reputation: 4419

When encountering a foreign script, in this case Cyrillic, often a trial-and-error procedure will find the right encoding type.

The rvest::guess_encoding does exactly that - make a guess, based on confidence scores. However, this sometimes fails to identify the encoding, in which case manual trial-and-error could solve the issue. Reading ?stri_enc_detect of the stringi package, one can find different (widely) used encoding styles for a specific language. For Cyrillic, try setting the encoding "ISO-8859-5" , "windows-1251" or "KOI8-R".

Upvotes: 1

Related Questions