Reputation: 888
I try to scrape the Russian journal names at https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1, but I have issues with the encoding.
Instead of showing Автоматика и телемеханика
, R displays Àâòîìàòèêà è òåëåìåõàíèêà
.
Even the use of rvest::guess_encoding()
's first result does not work. I also tried read_html(nauka_url, encoding="UTF-8")
, but received an error, telling me: "Input is not proper UTF-8, indicate encoding !"
Here is my code so far:
nauka_url <- "https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1"
nauka_encoding <- rvest::guess_encoding(nauka_url)
nauka_page <- xml2::read_html(nauka_url, encoding=nauka_encoding[1,1])
nauka_journals <- rvest::html_node(nauka_page, css='#wraps > div > div > div > div > div.block-themes-category.block-themes-category-elems')
nauka_journal_names <- rvest::html_nodes(nauka_journals, css='.edition__title')
nauka_journal_names <- rvest::html_text(nauka_journal_names)
How to obtain the correct Cyrillic letters? Thank you for your help!
Upvotes: 1
Views: 305
Reputation: 84465
Rather than guess first inspect the response headers for the charset
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
headers <- httr::GET('https://www.libnauka.ru/elektronnii-katalog/?PAGEN_1=1') %>%
httr::headers() %>%
.$`content-type`
print(str_match(headers, 'charset=(.*)')[1,2])
#> [1] "windows-1251"
Created on 2021-01-02 by the reprex package (v0.3.0)
Or query the page itself via console e.g.
Or, indeed check the instructions in the meta[charset]
i.e. meta tag with charset attribute (not fool-proof) via elements tab of browser:
Upvotes: 3
Reputation: 4419
When encountering a foreign script, in this case Cyrillic, often a trial-and-error procedure will find the right encoding type.
The rvest::guess_encoding
does exactly that - make a guess, based on confidence scores. However, this sometimes fails to identify the encoding, in which case manual trial-and-error could solve the issue.
Reading ?stri_enc_detect
of the stringi
package, one can find different (widely) used encoding styles for a specific language.
For Cyrillic, try setting the encoding "ISO-8859-5"
, "windows-1251"
or "KOI8-R"
.
Upvotes: 1