Reputation: 773
I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))
which generates this dataframe:
df
name foo var1 var2
1 a 1 1 3
2 a 2 2 3
3 a 3 3 3
4 b 4 4 4
5 b 5 5 4
6 b 6 6 4
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".
I know that I can use a combination of []
operators to achieve the result I want:
df[,grep("^var[:alnum:]?",colnames(df))][
df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
df[,grep("^var[:alnum:]?",colnames(df))] == 4
] <- NA
df
name foo var1 var2
1 a 1 1 NA
2 a 2 2 NA
3 a 3 NA NA
4 b 4 NA NA
5 b 5 5 NA
6 b 6 6 NA
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
Now my questions are the following:
|
operator?Upvotes: 19
Views: 35615
Reputation: 9858
Since dplyr 1.0.0 (early 2020), I believe the dplyr approach would be:
library(dplyr)
df %>% mutate(across(starts_with('var'), ~replace(., . %in% c(3,4), NA)))
name foo var1 var2
1 a 1 1 NA
2 a 2 2 NA
3 a 3 NA NA
4 b 4 NA NA
5 b 5 5 NA
6 b 6 6 NA
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
An alternative approach using the naniar package, which neatly imputes missing values to selected columns using a predicate function (here with str_detect()
):
library(dplyr)
library(stringr)
library(naniar)
df%>%replace_with_na_if(str_detect(names(.), '^var'), ~.%in%c(3,4))
It would be very nice to see the naniar package updated to work with current tidyselect synthax with across()
, and its selection helpers, with something like:
df%>%mutate(across(starts_with('var'), replace_with_na_all(condition=~.%in% c(3, 4))))
Upvotes: 7
Reputation: 479
I think dplyr
is very well-suited for this task.
Using replace()
as suggested by @thelatemail, you could do something like this:
library("dplyr")
df <- df %>%
mutate_at(vars(starts_with("var")),
funs(replace(., . %in% c(3, 4), NA)))
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
Upvotes: 2
Reputation: 69
Here is a dplyr solution:
# Define replace function
repl.f <- function(x) ifelse(x%in%c(3,4), NA,x)
library(dplyr)
cbind(select(df, -starts_with("var")),
mutate_each(select(df, starts_with("var")), funs(repl.f)))
name foo var1 var2
1 a 1 1 NA
2 a 2 2 NA
3 a 3 NA NA
4 b 4 NA NA
5 b 5 5 NA
6 b 6 6 NA
7 c 7 7 5
8 c 8 8 5
9 c 9 9 5
Upvotes: -3
Reputation: 193507
I haven't timed this option, but I have written a function called makemeNA
that is part of my GitHub-only "SOfun" package.
With that function, the approach would be something like this:
library(SOfun)
Cols <- grep("^var", names(df))
df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4)))
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
The function uses the na.strings
argument in type.convert
to do the conversion to NA
.
Install the package with:
library(devtools)
install_github("SOfun", "mrdwab")
(or your favorite method of installing packages from GitHub).
Here's some benchmarking. I've decided to make things interesting and replace both numeric and non-numeric values with NA
to see how things compare.
Here's the sample data:
n <- 1000000
set.seed(1)
df <- data.frame(
name1 = sample(letters[1:3], n, TRUE),
name2 = sample(letters[1:3], n, TRUE),
name3 = sample(letters[1:3], n, TRUE),
var1 = sample(9, n, TRUE),
var2 = sample(5, n, TRUE),
var3 = sample(9, n, TRUE))
Here are the functions to test:
fun1 <- function() {
Cols <- names(df)
df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4, "a")))
df
}
fun2 <- function() {
values <- c(3, 4, "a")
col_idx <- names(df)
m1 <- as.matrix(df)
m1[m1 %in% values] <- NA
df[col_idx] <- m1
df
}
fun3 <- function() {
values <- c(3, 4, "a")
col_idx <- names(df)
val_idx <- sapply(df[col_idx], "%in%", table = values)
is.na(df[col_idx]) <- val_idx
df
}
fun4 <- function() {
sel <- names(df)
df[sel] <- lapply(df[sel], function(x)
replace(x, x %in% c(3, 4, "a"), NA))
df
}
I'm breaking out fun2
and fun3
. I'm not crazy about fun2
because it converts everything to the same type. I also expect fun3
to be slower.
system.time(fun2())
# user system elapsed
# 4.45 0.33 4.81
system.time(fun3())
# user system elapsed
# 34.31 0.38 34.74
So now it comes down to me and Thela...
library(microbenchmark)
microbenchmark(fun1(), fun4(), times = 50)
# Unit: seconds
# expr min lq median uq max neval
# fun1() 2.934278 2.982292 3.070784 3.091579 3.617902 50
# fun4() 2.839901 2.964274 2.981248 3.128327 3.930542 50
Dang you Thela!
Upvotes: 4
Reputation: 93813
You can also do this using replace
:
sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df
# name foo var1 var2
#1 a 1 1 NA
#2 a 2 2 NA
#3 a 3 NA NA
#4 b 4 NA NA
#5 b 5 5 NA
#6 b 6 6 NA
#7 c 7 7 5
#8 c 8 8 5
#9 c 9 9 5
Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.
Upvotes: 16
Reputation: 886938
You could also do:
col_idx <- grep("^var", names(df))
values <- c(3, 4)
m1 <- as.matrix(df[,col_idx])
m1[m1 %in% values] <- NA
df[col_idx] <- m1
df
# name foo var1 var2
#1 a 1 1 NA
#2 a 2 2 NA
#3 a 3 NA NA
#4 b 4 NA NA
#5 b 5 5 NA
#6 b 6 6 NA
#7 c 7 7 5
#8 c 8 8 5
#9 c 9 9 5
Upvotes: 7
Reputation: 81683
Here's an approach:
# the values that should be replaced by NA
values <- c(3, 4)
# index of columns
col_idx <- grep("^var", names(df))
# [1] 3 4
# index of values (within these columns)
val_idx <- sapply(df[col_idx], "%in%", table = values)
# var1 var2
# [1,] FALSE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
# [4,] TRUE TRUE
# [5,] FALSE TRUE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
# replace with NA
is.na(df[col_idx]) <- val_idx
df
# name foo var1 var2
# 1 a 1 1 NA
# 2 a 2 2 NA
# 3 a 3 NA NA
# 4 b 4 NA NA
# 5 b 5 5 NA
# 6 b 6 6 NA
# 7 c 7 7 5
# 8 c 8 8 5
# 9 c 9 9 5
Upvotes: 4