Reputation: 2588
Consider this example data:
set.seed(1234567)
mydf <- data.frame(var1 = runif(10), var2 = c(runif(5), rep(NA, 5)))
And this EXAMPLE vectorized function that, unfortunately, triggers an error whenever one of the arguments is NA
myfn <- function(x, y){
sum(x:y)
}
myfn <- Vectorize(myfn)
Now, in the middle of a dplyr
chain I need to create a new variable using myfn
. This new var (var3
) is only defined when var1
and var2
are not NA
.
So the most common solution for similar situations is to use ifelse
. Something like this.
mydf %>%
mutate(var3 = ifelse(
test = is.na(var2),
yes = NA,
no = myfn(var1, var2)))
But this does not work in my case because ifelse
anyway actually passes the whole vector var1
and var2
to myfn
and not just the sub-vector when test
is FALSE
. And it all breaks because myfn
breaks whenever receives a NA
.
So, what is the clever dplyr
solution for this? (I can think of many solutions for this without using dplyr
, but I am just interested in a dplyr
-friendly solution)
It occurred to me that filter
could help and indeed works with a very readable and dplyr
y code
mydf %>%
filter(!is.na(var2)) %>%
mutate(var3 = myfn(var1, var2))
var1 var2 var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970
But then I would have to save this in a temporary object, then create var3
in the original data all with NA
and put all back together in the same data ('cause as far as I know the unfilter
that some have suggested does not exist, ..., yet).
So just to illustrate the output I want, this code produces it (no using dplyr
at all):
mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
mydf
> mydf
var1 var2 var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970
6 0.48005398 NA NA
7 0.08837960 NA NA
8 0.86294587 NA NA
9 0.49660306 NA NA
10 0.85350403 NA NA
EDIT:
I accepted @krlmlr's solution because it is what I was looking for: clear, easily readable and concise code that effortlessly integrate in a dplyr
chain. For my example, this solution looks like this.
mydf %>%
rowwise %>%
mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
However, as @krlmlr pointed out in his answer, to operate row by row has a cost in terms of performance. It may not be significant for small data sets or single time operations, but for larger data sets or repeating the operation millions of times, it could be considerable. To illustrate, here's a comparison using microbenchmark
and three solutions (base, dyplr and data.table) applied over a somewhat larger data set (not massive or anything, just 1000 rows instead of 10 in my original example).
library(data.table)
library(dplyr)
set.seed(1234567)
mydf <- data.frame(var1 = runif(1000), var2 = c(runif(500), rep(NA, 500)))
myfn <- function(x, y){
sum(x:y)
}
myfn <- Vectorize(myfn)
using_base <- function(){
mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
}
using_dplyr <- function(){
mydf <- mydf %>%
rowwise %>%
mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
}
using_datatable <- function(){
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
}
library(microbenchmark)
mbm <- microbenchmark(
using_base(), using_dplyr(), using_datatable(),
times = 1000)
library(ggplot2)
autoplot(mbm)
And as you can see, the dplyr
solution using rowwise
is considerably slower than its base
and data.table
rivals.
Upvotes: 8
Views: 1810
Reputation: 2743
This is a great case to adopt the pythonic style of begging forgiveness rather than asking permission.
You can solve this with tryCatch
and avoid condition-testing altogether:
myfn <- function(x, y){
tryCatch(sum(x:y), error = function(e) NA)
}
Then
myfn <- Vectorize(myfn)
mydf %>%
mutate(var3 = myfn(var1, var2))
gives the desired result
var1 var2 var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970
6 0.48005398 NA NA
7 0.08837960 NA NA
8 0.86294587 NA NA
9 0.49660306 NA NA
10 0.85350403 NA NA
Addendum
Of course, it's a good idea to only pass NA on the right type of error, which is
> tryCatch(sum(NA:NA), error = function(e) print(str(e)))
List of 2
$ message: chr "NA/NaN argument"
$ call : language NA:NA
- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
NULL
Upvotes: 1
Reputation: 70256
Here are two other options you could use in dplyr-pipes:
a) with a temporary variable
mutate(mydf, temp = !(is.na(var1) | is.na(var2)),
var3 = replace(NA, temp, myfn(var1[temp], var2[temp])),
temp = NULL)
# var1 var2 var3
#1 0.56226084 0.62588794 0.56226084
#2 0.72649850 0.24145251 0.72649850
#3 0.91524985 0.03768974 0.91524985
#4 0.02969437 0.51659297 0.02969437
#5 0.76750970 0.81845788 0.76750970
#6 0.48005398 NA NA
#7 0.08837960 NA NA
#8 0.86294587 NA NA
#9 0.49660306 NA NA
#10 0.85350403 NA NA
b) with a wrapper function (without changing the original myfn
):
myfn2 <- function(x, y) {
i <- !(is.na(x) | is.na(y))
res <- rep(NA, length(x))
res[i] <- myfn(x[i], y[i])
res
}
mutate(mydf, var3 = myfn2(var1, var2))
# var1 var2 var3
#1 0.56226084 0.62588794 0.56226084
#2 0.72649850 0.24145251 0.72649850
#3 0.91524985 0.03768974 0.91524985
#4 0.02969437 0.51659297 0.02969437
#5 0.76750970 0.81845788 0.76750970
#6 0.48005398 NA NA
#7 0.08837960 NA NA
#8 0.86294587 NA NA
#9 0.49660306 NA NA
#10 0.85350403 NA NA
Upvotes: 2
Reputation: 25444
If your original function isn't vectorized and cannot cope with certain inputs, there's no performance benefit in vectorizing it using Vectorize()
. Instead, use dplyr::rowwise()
to operate row by row:
iris %>%
rowwise %>%
mutate(x = if (Sepal.Length < 5) 1 else NA) %>%
ungroup
Note that using if
here is perfectly safe, as the input has length 1.
Upvotes: 2
Reputation: 93761
You could run the function on the complete rows and then bind back the rows with NA
(though this is more roundabout than the if
... else
approach):
mydf %>% filter(complete.cases(.)) %>%
mutate(var3 = myfn(var1, var2)) %>%
bind_rows(mydf %>% filter(!complete.cases(.)))
var1 var2 var3 (dbl) (dbl) (dbl) 1 0.56226084 0.62588794 0.56226084 2 0.72649850 0.24145251 0.72649850 3 0.91524985 0.03768974 0.91524985 4 0.02969437 0.51659297 0.02969437 5 0.76750970 0.81845788 0.76750970 6 0.48005398 NA NA 7 0.08837960 NA NA 8 0.86294587 NA NA 9 0.49660306 NA NA 10 0.85350403 NA NA
Upvotes: 1
Reputation: 24178
You could perhaps consider using data.table
, given that dplyr
currently does not support in-place mutation, which is what you seem to be looking for.
library(data.table)
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
# var1 var2 var3
# 1: 0.56226084 0.62588794 0.56226084
# 2: 0.72649850 0.24145251 0.72649850
# 3: 0.91524985 0.03768974 0.91524985
# 4: 0.02969437 0.51659297 0.02969437
# 5: 0.76750970 0.81845788 0.76750970
# 6: 0.48005398 NA NA
# 7: 0.08837960 NA NA
# 8: 0.86294587 NA NA
# 9: 0.49660306 NA NA
#10: 0.85350403 NA NA
Upvotes: 5