Reputation: 67
I have rewritten the following to clarify this problem and the solution, and left the function and solution as an example at the bottom. Thanks again to John Coleman for the help!
The problem: I created a data scrape function which worked when passing one url, but not a vector of urls, throwing this error:
Error in data.frame(address, recipename, prept, cookt, calories, protein, :
arguments imply differing number of rows: 1, 14, 0
It turned out that some of the urls I was trying to scrape had a different tag for their instructions sections. This resulted in the xpathSApply
scrape for instructions returning a list of length 0, which produced an error when passed to the rbind.
Figuring out the problem was simply a matter of running each url through until I found one that produced an error, and checking the html structure of that page.
Here is the function I originally wrote:
f4fscrape <- function(url) {
#Create an empty dataframe
df <- data.frame(matrix(ncol = 11, nrow = 0))
colnames <- c('address', 'recipename', 'prept', 'cookt',
'calories', 'protein', 'carbs', 'fat',
'servings', 'ingredients', 'instructions')
colnames(df) <- paste(colnames)
#check for the recipe url in dataframe already,
#only carry on if not present
for (i in length(url))
if (url[i] %in% df$url) { next }
else {
#parse url as html
doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)
#define the root node
top2 <- xmlRoot(doc2)
#scrape relevant data
address <- url[i]
recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
#create a data.frame of the url and relevant data.
result <- data.frame(address, recipename, prept, cookt,
calories, protein, carbs, fat,
servings, list(ingredients), instructions)
#rename the tricky column
colnames(result)[10] <- 'ingredients'
#bind data to existing df
df <- rbind(df, result)
}
#return df
df
}
And here's the solution - I simply added a conditional as follows:
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
if (length(instructions) == 0) {
instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}
Upvotes: 1
Views: 183
Reputation: 51998
I was able to tweak your function so that it works:
f4fscrape <- function(urls) {
#Create an empty dataframe
df <- data.frame(matrix(ncol = 11, nrow = 0))
cnames <- c('address', 'recipename', 'prept', 'cookt',
'calories', 'protein', 'carbs', 'fat',
'servings', 'ingredients', 'instructions')
names(df) <- cnames
#check for the recipe url in dataframe already,
#only carry on if not present
for (i in 1:length(urls))
if (urls[i] %in% df$address) {
next }
else {
#parse url as html
doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)
#define the root node
top2 <- xmlRoot(doc2)
#scrape relevant data
address <- urls[i]
recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
#create a data.frame of the url and relevant data.
result <- data.frame(address, recipename, prept, cookt,
calories, protein, carbs, fat,
servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)
df <- rbind(df, setNames(result, names(df)))
}
#return df
df
}
Changes:
1) url
is a built-in function, so I renamed it urls
, similarly for colnames
2) I changed the way the column names were assigned.
3) The loop for (i in length(url))
skips to the last index. I changed it to
for (i in 1:length(urls))
4) The condition if (url[i] %in% df$url)
referred to a nonexistent column (url
). I changed that to address
.
5) The most important change: I concatenated the ingredients to a single string using paste0
. With what you had, in the 1-url case, each ingredient was put on its own line, and the other columns (by the recycling rule) were just repeated. Run your current code with a single url and View()
the result -- it probably isn't what you intended, so it isn't true that "It works when one url is passed to it".
6) With all these long strings, it seemed good to set stringsAsFactors = FALSE
.
7) You need to set the names in the data frame when you rbind
the new row. See this question.
When you View
the result of running the tweaked function on your given list you see the following (although not so zoomed-out of course):
I don't know enough about the XML
library to help you with the speed. Sometimes it runs slowly, sometimes fast, so it might have to do mostly with connection speed and be largely beyond your control.
Upvotes: 1