Timm S.
Timm S.

Reputation: 5415

R - pass vector to custom function to dplyr::mutate

I have the following function which allows me to scrape Wikipedia content from its URL (exact content is irrelevant for this question)

getPageContent <- function(url) {

        library(rvest)
        library(magrittr)

        pc <- html(url) %>% 
                html_node("#mw-content-text") %>% 
                # strip tags
                html_text() %>%
                # concatenate vector of texts into one string
                paste(collapse = "")

        pc
}

When using the function on a specific URL, this works.

getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")

[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)

However, if I want to pass the function to dplyr to get the content of multiple pages, I get an error:

example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
                              "https://en.wikipedia.org/wiki/Koncerthuset",
                              "https://en.wikipedia.org/wiki/Tifama_chera",
                              "https://en.wikipedia.org/wiki/Difference_theory"),
                      stringsAsFactors = FALSE
                      )

library(dplyr)
example <- mutate(example, content = getPageContent(url))

Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
  the condition has length > 1 and only the first element will be used

Looking at the error, I assume the problem lies with getPageContent's inability to handle a vector of URLs, but I have no idea how to solve it.

++++

EDIT: The two proposed solutions - 1) use rowwise() and 2) use sapply() both work well. Simulating with 10 random WP articles, the second approach is 25% quicker:

> system.time(
+         example <- example %>% 
+                 rowwise() %>% 
+                 mutate(content = getPageContent(url)) 
+ )
       User      System verstrichen 
       0.39        0.14        1.21 
> 
> 
> system.time(
+         example$content <- unlist(lapply(example$url, getPageContent))
+ )
       User      System verstrichen 
       0.49        0.11        0.90 

Upvotes: 3

Views: 3054

Answers (2)

ulfelder
ulfelder

Reputation: 5335

Instead of trying to pass a vector of strings to a function that's looking for a single string, why not use lapply() on a vector of URLs:

urls = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
         "https://en.wikipedia.org/wiki/Koncerthuset",
         "https://en.wikipedia.org/wiki/Tifama_chera",
         "https://en.wikipedia.org/wiki/Difference_theory")

And then:

content <- lapply(urls, getPageContent)

...which gives you back a list. Or, if your urls are already in a data frame and you want to add the contents as a new column in it, use sapply(), which returns a vector instead of a list:

example$contents <- sapply(example$url, getPageContent)

Upvotes: 2

akrun
akrun

Reputation: 887038

You can use rowwise() and it would work

 res <- example %>% 
             rowwise() %>% 
             mutate(content=getPageContent(url))

Upvotes: 10

Related Questions