Function throws error applied to a vector but not one element

Question

I have rewritten the following to clarify this problem and the solution, and left the function and solution as an example at the bottom. Thanks again to John Coleman for the help!

The problem: I created a data scrape function which worked when passing one url, but not a vector of urls, throwing this error:

Error in data.frame(address, recipename, prept, cookt, calories, protein, : arguments imply differing number of rows: 1, 14, 0

It turned out that some of the urls I was trying to scrape had a different tag for their instructions sections. This resulted in the xpathSApply scrape for instructions returning a list of length 0, which produced an error when passed to the rbind.

Figuring out the problem was simply a matter of running each url through until I found one that produced an error, and checking the html structure of that page.

Here is the function I originally wrote:

f4fscrape <- function(url) {

#Create an empty dataframe

    df <- data.frame(matrix(ncol = 11, nrow = 0))
    colnames <- c('address', 'recipename', 'prept', 'cookt',
                  'calories', 'protein', 'carbs', 'fat',
                  'servings', 'ingredients', 'instructions')
    colnames(df) <- paste(colnames)

    #check for the recipe url in dataframe already,
    #only carry on if not present

    for (i in length(url)) 
            if (url[i] %in% df$url) { next }
    else {

    #parse url as html

    doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- url[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, list(ingredients), instructions)

    #rename the tricky column

    colnames(result)[10] <- 'ingredients'

    #bind data to existing df

    df <- rbind(df, result)
            }

    #return df

    df
}

And here's the solution - I simply added a conditional as follows:

instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
            if (length(instructions) == 0) {
                    instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}

John Coleman · Accepted Answer

I was able to tweak your function so that it works:

f4fscrape <- function(urls) {

  #Create an empty dataframe

  df <- data.frame(matrix(ncol = 11, nrow = 0))
  cnames <- c('address', 'recipename', 'prept', 'cookt',
                'calories', 'protein', 'carbs', 'fat',
                'servings', 'ingredients', 'instructions')

  names(df) <- cnames

  #check for the recipe url in dataframe already,
  #only carry on if not present

  for (i in 1:length(urls)) 
    if (urls[i] %in% df$address) {
      next }
  else {
    #parse url as html

    doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- urls[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)


    df <- rbind(df, setNames(result, names(df)))
  }

  #return df

  df
}

Changes:

1) url is a built-in function, so I renamed it urls, similarly for colnames

2) I changed the way the column names were assigned.

3) The loop for (i in length(url)) skips to the last index. I changed it to for (i in 1:length(urls))

4) The condition if (url[i] %in% df$url) referred to a nonexistent column (url). I changed that to address.

5) The most important change: I concatenated the ingredients to a single string using paste0. With what you had, in the 1-url case, each ingredient was put on its own line, and the other columns (by the recycling rule) were just repeated. Run your current code with a single url and View() the result -- it probably isn't what you intended, so it isn't true that "It works when one url is passed to it".

6) With all these long strings, it seemed good to set stringsAsFactors = FALSE.

7) You need to set the names in the data frame when you rbind the new row. See this question.

When you View the result of running the tweaked function on your given list you see the following (although not so zoomed-out of course):

I don't know enough about the XML library to help you with the speed. Sometimes it runs slowly, sometimes fast, so it might have to do mostly with connection speed and be largely beyond your control.

Function throws error applied to a vector but not one element

Answers (1)

Related Questions