Reputation: 2588

create new variable defined only for a subset of the data using `dplyr`

Consider this example data:

set.seed(1234567)
mydf <- data.frame(var1 = runif(10), var2 = c(runif(5), rep(NA, 5)))

And this EXAMPLE vectorized function that, unfortunately, triggers an error whenever one of the arguments is NA

myfn <- function(x, y){
    sum(x:y)
}
myfn <- Vectorize(myfn)

Now, in the middle of a dplyr chain I need to create a new variable using myfn. This new var (var3) is only defined when var1 and var2 are not NA.

So the most common solution for similar situations is to use ifelse. Something like this.

mydf %>%
    mutate(var3 = ifelse(
        test = is.na(var2), 
        yes = NA, 
        no = myfn(var1, var2)))

But this does not work in my case because ifelse anyway actually passes the whole vector var1 and var2 to myfn and not just the sub-vector when test is FALSE. And it all breaks because myfn breaks whenever receives a NA.

So, what is the clever dplyr solution for this? (I can think of many solutions for this without using dplyr, but I am just interested in a dplyr-friendly solution)

It occurred to me that filter could help and indeed works with a very readable and dplyry code

mydf %>%
    filter(!is.na(var2)) %>%
    mutate(var3 = myfn(var1, var2))

        var1       var2       var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970

But then I would have to save this in a temporary object, then create var3 in the original data all with NA and put all back together in the same data ('cause as far as I know the unfilter that some have suggested does not exist, ..., yet).

So just to illustrate the output I want, this code produces it (no using dplyr at all):

mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
mydf

> mydf
         var1       var2       var3
1  0.56226084 0.62588794 0.56226084
2  0.72649850 0.24145251 0.72649850
3  0.91524985 0.03768974 0.91524985
4  0.02969437 0.51659297 0.02969437
5  0.76750970 0.81845788 0.76750970
6  0.48005398         NA         NA
7  0.08837960         NA         NA
8  0.86294587         NA         NA
9  0.49660306         NA         NA
10 0.85350403         NA         NA

EDIT:

I accepted @krlmlr's solution because it is what I was looking for: clear, easily readable and concise code that effortlessly integrate in a dplyr chain. For my example, this solution looks like this.

mydf %>%
        rowwise %>%
        mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))

However, as @krlmlr pointed out in his answer, to operate row by row has a cost in terms of performance. It may not be significant for small data sets or single time operations, but for larger data sets or repeating the operation millions of times, it could be considerable. To illustrate, here's a comparison using microbenchmark and three solutions (base, dyplr and data.table) applied over a somewhat larger data set (not massive or anything, just 1000 rows instead of 10 in my original example).

library(data.table)
library(dplyr)

set.seed(1234567)
mydf <- data.frame(var1 = runif(1000), var2 = c(runif(500), rep(NA, 500)))

myfn <- function(x, y){
    sum(x:y)
}
myfn <- Vectorize(myfn)

using_base <- function(){
    mydf$var3 <- NA
    index <- !is.na(mydf$var2)
    mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
}

using_dplyr <- function(){
    mydf <- mydf %>%
        rowwise %>%
        mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
}

using_datatable <- function(){
    setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
}

library(microbenchmark)
mbm <- microbenchmark(
    using_base(), using_dplyr(), using_datatable(), 
    times = 1000)

library(ggplot2)
autoplot(mbm)

And as you can see, the dplyr solution using rowwise is considerably slower than its base and data.table rivals.

Upvotes: 8

Answers (5)

jaimedash

Reputation: 2743

This is a great case to adopt the pythonic style of begging forgiveness rather than asking permission.

You can solve this with tryCatch and avoid condition-testing altogether:

myfn <- function(x, y){ 
  tryCatch(sum(x:y), error = function(e) NA)
}

Then

myfn <- Vectorize(myfn)
mydf %>%
    mutate(var3 = myfn(var1, var2))

gives the desired result

         var1       var2       var3
1  0.56226084 0.62588794 0.56226084
2  0.72649850 0.24145251 0.72649850
3  0.91524985 0.03768974 0.91524985
4  0.02969437 0.51659297 0.02969437
5  0.76750970 0.81845788 0.76750970
6  0.48005398         NA         NA
7  0.08837960         NA         NA
8  0.86294587         NA         NA
9  0.49660306         NA         NA
10 0.85350403         NA         NA

Addendum

Of course, it's a good idea to only pass NA on the right type of error, which is

> tryCatch(sum(NA:NA), error = function(e) print(str(e)))
List of 2
 $ message: chr "NA/NaN argument"
 $ call   : language NA:NA
 - attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
NULL

Upvotes: 1

talat

Reputation: 70256

Here are two other options you could use in dplyr-pipes:

a) with a temporary variable

mutate(mydf, temp = !(is.na(var1) | is.na(var2)),
       var3 = replace(NA, temp, myfn(var1[temp], var2[temp])),
       temp = NULL)
#         var1       var2       var3
#1  0.56226084 0.62588794 0.56226084
#2  0.72649850 0.24145251 0.72649850
#3  0.91524985 0.03768974 0.91524985
#4  0.02969437 0.51659297 0.02969437
#5  0.76750970 0.81845788 0.76750970
#6  0.48005398         NA         NA
#7  0.08837960         NA         NA
#8  0.86294587         NA         NA
#9  0.49660306         NA         NA
#10 0.85350403         NA         NA

b) with a wrapper function (without changing the original myfn):

myfn2 <- function(x, y) {
  i <- !(is.na(x) | is.na(y))
  res <- rep(NA, length(x))
  res[i] <- myfn(x[i], y[i])
  res
}

mutate(mydf, var3 = myfn2(var1, var2))
#         var1       var2       var3
#1  0.56226084 0.62588794 0.56226084
#2  0.72649850 0.24145251 0.72649850
#3  0.91524985 0.03768974 0.91524985
#4  0.02969437 0.51659297 0.02969437
#5  0.76750970 0.81845788 0.76750970
#6  0.48005398         NA         NA
#7  0.08837960         NA         NA
#8  0.86294587         NA         NA
#9  0.49660306         NA         NA
#10 0.85350403         NA         NA

Upvotes: 2

krlmlr

Reputation: 25444

If your original function isn't vectorized and cannot cope with certain inputs, there's no performance benefit in vectorizing it using Vectorize(). Instead, use dplyr::rowwise() to operate row by row:

iris %>%
  rowwise %>%
  mutate(x = if (Sepal.Length < 5) 1 else NA) %>%
  ungroup

Note that using if here is perfectly safe, as the input has length 1.

Upvotes: 2

eipi10

Reputation: 93761

You could run the function on the complete rows and then bind back the rows with NA (though this is more roundabout than the if... else approach):

mydf %>% filter(complete.cases(.)) %>% 
  mutate(var3 = myfn(var1, var2)) %>%
  bind_rows(mydf %>% filter(!complete.cases(.)))

         var1       var2       var3
        (dbl)      (dbl)      (dbl)
1  0.56226084 0.62588794 0.56226084
2  0.72649850 0.24145251 0.72649850
3  0.91524985 0.03768974 0.91524985
4  0.02969437 0.51659297 0.02969437
5  0.76750970 0.81845788 0.76750970
6  0.48005398         NA         NA
7  0.08837960         NA         NA
8  0.86294587         NA         NA
9  0.49660306         NA         NA
10 0.85350403         NA         NA

Upvotes: 1

mtoto

Reputation: 24178

You could perhaps consider using data.table, given that dplyr currently does not support in-place mutation, which is what you seem to be looking for.

library(data.table)
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
#        var1       var2       var3
# 1: 0.56226084 0.62588794 0.56226084
# 2: 0.72649850 0.24145251 0.72649850
# 3: 0.91524985 0.03768974 0.91524985
# 4: 0.02969437 0.51659297 0.02969437
# 5: 0.76750970 0.81845788 0.76750970
# 6: 0.48005398         NA         NA
# 7: 0.08837960         NA         NA
# 8: 0.86294587         NA         NA
# 9: 0.49660306         NA         NA
#10: 0.85350403         NA         NA

Upvotes: 5

create new variable defined only for a subset of the data using `dplyr`

Answers (5)

Related Questions