Reputation: 5415
I have the following function which allows me to scrape Wikipedia content from its URL (exact content is irrelevant for this question)
getPageContent <- function(url) {
library(rvest)
library(magrittr)
pc <- html(url) %>%
html_node("#mw-content-text") %>%
# strip tags
html_text() %>%
# concatenate vector of texts into one string
paste(collapse = "")
pc
}
When using the function on a specific URL, this works.
getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")
[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
However, if I want to pass the function to dplyr
to get the content of multiple pages, I get an error:
example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory"),
stringsAsFactors = FALSE
)
library(dplyr)
example <- mutate(example, content = getPageContent(url))
Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
the condition has length > 1 and only the first element will be used
Looking at the error, I assume the problem lies with getPageContent
's inability to handle a vector of URLs, but I have no idea how to solve it.
++++
EDIT: The two proposed solutions - 1) use rowwise()
and 2) use sapply()
both work well. Simulating with 10 random WP articles, the second approach is 25% quicker:
> system.time(
+ example <- example %>%
+ rowwise() %>%
+ mutate(content = getPageContent(url))
+ )
User System verstrichen
0.39 0.14 1.21
>
>
> system.time(
+ example$content <- unlist(lapply(example$url, getPageContent))
+ )
User System verstrichen
0.49 0.11 0.90
Upvotes: 3
Views: 3054
Reputation: 5335
Instead of trying to pass a vector of strings to a function that's looking for a single string, why not use lapply()
on a vector of URLs:
urls = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory")
And then:
content <- lapply(urls, getPageContent)
...which gives you back a list. Or, if your urls are already in a data frame and you want to add the contents as a new column in it, use sapply()
, which returns a vector instead of a list:
example$contents <- sapply(example$url, getPageContent)
Upvotes: 2
Reputation: 887038
You can use rowwise()
and it would work
res <- example %>%
rowwise() %>%
mutate(content=getPageContent(url))
Upvotes: 10