Reputation: 239

issues with webscraping loop in rvest

I have a couple of issues that I cannot find a reasonable solution for through searching. I am working on scraping citation information from journals and am having a little hitch when it comes to compiling them in a dataframe.

This code scrapes well, but the big issue is that it creates one long vector instead of a table. Thats the first issue.

The second issue is that if I try to load in the webpages from a csv file, the script will not run. I will get the following error:

Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "factor"

That is just a csv with the urls listed.

The third and final issue is that an article might have more than one email and thus more than one row. The code ignores this. For example, the paper http://journals.sagepub.com/doi/full/10.3102/0013189X17737739

library(rvest)
data<- c("http://journals.sagepub.com/doi/abs/10.3102/0013189X037001060", 
  "http://journals.sagepub.com/doi/abs/10.3102/0013189X037002102",
  "http://journals.sagepub.com/doi/abs/10.3102/0013189X037002104",
  "http://journals.sagepub.com/doi/full/10.3102/0013189X17737739")

scrape <- function(x){
  doc<-read_html(x)
  author <- html_text(html_nodes(doc, '.art_authors'))
  year <- html_text(html_nodes(doc, '.year'))
  journalName <- html_text(html_nodes(doc, '.journalName'))
  art_title <- html_text(html_nodes(doc, '.art_title'))
  volume <- html_text(html_nodes(doc, '.volume'))
  page <- html_text(html_nodes(doc, '.page'))
  email <- html_text(html_nodes(doc, xpath = "//a[@class = 'email']"))

  Author = ifelse(length(author)==0, NA, author)
  Year = ifelse(length(year)==0, NA, year)
  Journal_Name = ifelse(length(journalName)==0, NA, journalName) 
  Art_Title = ifelse(length(art_title)==0, NA, art_title)
  Volume = ifelse(length(volume)==0, NA, volume)
  Page = ifelse(length(page)==0, NA, page)
  Email = ifelse(length(email)==0, NA, email)

  row<-cbind(Author, Year, Journal_Name, Art_Title, Volume, Page, Email)
}

y <- lapply (data, scrape)

View (y)

when I try to do run the script from a csv

data<- read.csv ("link_test.csv")
y <- lapply (data$link, scrape)

Any help would be greatly appreciated.

Upvotes: 1

Answers (2)

SeGa

Reputation: 9809

If you call this function as last line, you'll get what you want ;)

y <- do.call(rbind, y)

library(DT)
datatable(y)

For several email adresses you should change the last but one line of the function to:

  Email = ifelse(length(email)==0, NA, 
          ifelse(length(email)==1, email, paste(email, collapse=" ; ")))

But I didn't test that, as I didnt find any webpages with several E-mail adresses.

Upvotes: 2

J. Win.

Reputation: 6771

For the csv, it's hard to answer without seeing the file structure, at least a couple rows. However, the problem might be solved by this:

# bind your list items together as rows
df <- do.call(rbind, y)
# ensure each column is class character rather than factors
df <- as.data.frame(df, stringsAsFactors = FALSE)

EDIT: updating to answer your edit. In some webpages there are multiple authors, which the webpage seems to present all in one node, as a comma-separated text string. The code you posted does not seem to return any emails for your example webpages. However, if it did return a list or vector of emails, you could collapse them by pasting as shown below:

Email = ifelse(length(email)==0, NA, do.call(paste, email, sep = ", "))

Upvotes: 1

issues with webscraping loop in rvest

Answers (2)

Related Questions