Reputation: 69
I'm learning a bit about webscraping and I'm having a little doubt regarding 2 packages (httr and RCurl), I'm trying to get a code from a magazine (ISSN) on the researchgate website and I came across a situation. When extracting the content from the site by httr and RCurl, I get the ISSN in the RCurl package and in httr my function is returning NULL, could anyone tell me why this? in my opinion it was for both functions to be working. Follow the code below.
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"
########
# httr #
########
conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status
content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1
ISSN <- webpage1 %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
########
# RCurl #
########
options(RCurlOptions = list(verbose = FALSE,
capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"),
ssl.verifypeer = FALSE))
webpage <- getURLContent(url) %>% read_html()
ISSN <- webpage %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 [4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages: [1] stats graphics grDevices utils
datasets methods baseother attached packages: [1] testit_0.7 dplyr_0.7.4
progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10 bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
jsonlite_1.5loaded via a namespace (and not attached): [1] Rcpp_0.12.16
bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
tools_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0 tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
[13] stringi_1.1.7 pillar_1.2.2 compiler_3.5.0
cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1
Upvotes: 1
Views: 914
Reputation: 6106
Because the content type is JSON and not HTML, you can't use read_html()
on it:
> conexao
Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
Date: 2018-06-02 03:15
Status: 200
Content-Type: application/json; charset=utf-8
Size: 328 kB
Use fromJSON()
instead to extract issn:
library(jsonlite)
result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
result$result$data$journalFullInfo$data$issn
result:
> result$result$data$journalFullInfo$data$issn
[1] "0730-0301"
Upvotes: 2