Webscraping cyrillic letters - encoding issue with rvest

Question

I try to scrape the Russian journal names at https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1, but I have issues with the encoding.

Instead of showing Автоматика и телемеханика, R displays Àâòîìàòèêà è òåëåìåõàíèêà.

Even the use of rvest::guess_encoding()'s first result does not work. I also tried read_html(nauka_url, encoding="UTF-8"), but received an error, telling me: "Input is not proper UTF-8, indicate encoding !"

Here is my code so far:

  nauka_url <- "https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1"

  nauka_encoding <- rvest::guess_encoding(nauka_url)

  nauka_page <- xml2::read_html(nauka_url, encoding=nauka_encoding[1,1])
  
  nauka_journals <- rvest::html_node(nauka_page, css='#wraps > div > div > div > div > div.block-themes-category.block-themes-category-elems')
  
  nauka_journal_names <- rvest::html_nodes(nauka_journals, css='.edition__title')
  nauka_journal_names <- rvest::html_text(nauka_journal_names)

How to obtain the correct Cyrillic letters? Thank you for your help!

Donald Seinen · Accepted Answer

When encountering a foreign script, in this case Cyrillic, often a trial-and-error procedure will find the right encoding type.

The rvest::guess_encoding does exactly that - make a guess, based on confidence scores. However, this sometimes fails to identify the encoding, in which case manual trial-and-error could solve the issue. Reading ?stri_enc_detect of the stringi package, one can find different (widely) used encoding styles for a specific language. For Cyrillic, try setting the encoding "ISO-8859-5" , "windows-1251" or "KOI8-R".

Webscraping cyrillic letters - encoding issue with rvest

Answers (2)

Related Questions