Oliver
Oliver

Reputation: 284

How do I use rvest to sort text into different columns?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs, which I'm using to scrape affiliation data. However, each time I try, it gives me the 404 error: Error in open.connection(x, "rb") : HTTP error 404

It must be an issue with my use of sapply because when I test it using an individual ID, it works. Here is the code I'm using:

df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")

df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")

df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))

I actually need to do this for six columns of authors and there are NA values I'd like to skip so if anyone knows how to do that as well, I would be enormously grateful (but not a big deal if I not). Thank you in advance for your help!

EDIT: I have just discovered that the error is in the formula for the websites. Sometimes it should be df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html") and sometimes it should be df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")

Does anyone know how to get R to try both and give me the one that works?

Upvotes: 1

Views: 47

Answers (1)

StupidWolf
StupidWolf

Reputation: 46968

You can have the two links and use try on bottom of them. I am assuming there is only 1 that would give a valid website. Otherwise we can always edit the code to take in everything that works:

library(rvest)
library(purrr)

df = data.frame(id=1:6)

df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"

df$affiliation <- sapply(df$author_reg, function(x){
  links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))

# here we try both links and store under attempt
  attempts = links %>% map(function(i){
    try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
  })

# the good ones will have "character" class, the failed ones, try-error
  gdlink = which(sapply(attempts,class) != "try-error")
  if(length(gdlink)>0){
  return(attempts[[gdlink[1]]])
  }
  else{
  return("True 404 error")
  }
})

Check the results:

 df
  id author_reg
1  1       paa6
2  2       paa2
3  3       paa1
4  4       paa8
5  5     pve266
6  6     pya500
                                                                                                                   affiliation
1                                                                                   Statistisk SentralbyråGovernment of Norway
2                                                              Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4                                                                       Centraal Planbureau (CPB)Government of the Netherlands
5                   Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6                                                                            Business SchoolSwinburne University of Technology

Upvotes: 1

Related Questions