Reputation: 9446
Having a data frame, how do I go about replacing all particular values along all rows and columns. Say for example I want to replace all empty records with NA
's (without typing the positions):
df <- data.frame(list(A=c("", "xyz", "jkl"), B=c(12, "", 100)))
A B
1 12
2 xyz
3 jkl 100
Expected result:
A B
1 NA 12
2 xyz NA
3 jkl 100
Upvotes: 114
Views: 397315
Reputation: 856
It´s also possible to use the gsub function in combination with lapply.
df[] <- lapply(df, function(x) (gsub("", NA, x)))
Upvotes: 0
Reputation: 41533
Another option could be using sapply
. Here is a reproducible example:
df <- data.frame(list(A=c("", "xyz", "jkl"), B=c(12, "", 100)))
df[sapply(df, \(x) x == "")] <- NA
df
#> A B
#> 1 <NA> 12
#> 2 xyz <NA>
#> 3 jkl 100
Created on 2023-01-15 with reprex v2.0.2
Please note: with R 4.1.0 and later you could use \(x)
instead of function(x)
Upvotes: 1
Reputation: 16876
Another option is to use replace_with_na_all()
from the naniar
package, which allows you to replace all the values meeting a condition in the entire dataframe.
library(naniar)
library(dplyr)
df %>%
replace_with_na_all(condition = ~.x == "")
Output
A B
<chr> <chr>
1 NA 12
2 xyz NA
3 jkl 100
The upside to this method is that if you also had some cells that also had spaces included, then we could provide both in the conditions argument. Although it would be better to first just trim the whitespace, then use the function above (i.e., adding mutate(across(everything(), ~ trimws(.x)))
to the pipe).
df <- data.frame(list(A=c("", "xyz", " "), B=c(12, " ", 100)))
df %>%
replace_with_na_all(condition = ~.x %in% c("", " ", " "))
# A B
# <chr> <chr>
#1 NA 12
#2 xyz NA
#3 NA 100
Upvotes: 0
Reputation: 73612
It appears that a solution is missing for multiple values to be replaced and for factors, so I will add one.
Consider a data frame dat
with various classes.
dat
# character integer Date factor POSIX
# 1 4 2022-07-10 B 2022-07-10 20:08:10
# 2 1 2022-07-11 FOO 2022-07-10 21:08:10
# 3 -2 2022-07-12 2022-07-10 22:08:10
# 4 2 2022-07-13 B 2022-07-10 23:08:10
# 5 a 3 2022-07-14 2022-07-11 00:08:10
# 6 c 1 2022-07-15 2022-07-11 01:08:10
# 7 a -1 2022-07-16 FOO 2022-07-11 02:08:10
# 8 a -1 2022-07-17 A 2022-07-11 03:08:10
# 9 4 2022-07-18 FOO 2022-07-11 04:08:10
# 10 c 0 2022-07-19 FOO 2022-07-11 05:08:10
# 11 b -2 2022-07-20 B 2022-07-11 06:08:10
# 12 c -2 2022-07-21 A 2022-07-11 07:08:10
We may put everything we want to convert to NA on a list to_na
,
To_NA <- list('', -1, -2, 'c', 'FOO', as.Date('2022-07-17'), as.POSIXct('2022-07-11 00:08:10'))
and use it in a small function make_na
based on replace
. if
the respective variable is.factor
we may want to droplevels
of values that have just been deleted.
make_na <- \(x, z) {x <- replace(x, x %in% z, NA); if (is.factor(x)) droplevels(x) else x}
We can apply it on a vector,
make_na(dat$character, To_NA)
# [1] NA NA NA NA "a" NA "a" "a" NA NA "b" NA
or loop over the columns using lapply
.
dat[] <- lapply(dat, make_na, To_NA)
dat
# character integer Date factor POSIX
# 1 <NA> 4 2022-07-10 B 2022-07-10 20:08:10
# 2 <NA> 1 2022-07-11 <NA> 2022-07-10 21:08:10
# 3 <NA> NA 2022-07-12 <NA> 2022-07-10 22:08:10
# 4 <NA> 2 2022-07-13 B 2022-07-10 23:08:10
# 5 a 3 2022-07-14 <NA> <NA>
# 6 <NA> 1 2022-07-15 <NA> 2022-07-11 01:08:10
# 7 a NA 2022-07-16 <NA> 2022-07-11 02:08:10
# 8 a NA <NA> A 2022-07-11 03:08:10
# 9 <NA> 4 2022-07-18 <NA> 2022-07-11 04:08:10
# 10 <NA> 0 2022-07-19 <NA> 2022-07-11 05:08:10
# 11 b NA 2022-07-20 B 2022-07-11 06:08:10
# 12 <NA> NA 2022-07-21 A 2022-07-11 07:08:10
Where:
str(dat)
# 'data.frame': 12 obs. of 5 variables:
# $ character: chr NA NA NA NA ...
# $ integer : int 4 1 NA 2 3 1 NA NA 4 0 ...
# $ Date : Date, format: "2022-07-10" "2022-07-11" "2022-07-12" ...
# $ factor : Factor w/ 2 levels "A","B": 2 NA NA 2 NA NA NA 1 NA NA ...
# $ POSIX : POSIXct, format: "2022-07-10 20:08:10" "2022-07-10 21:08:10" "2022-07-10 22:08:10" ...
Data:
dat <- structure(list(character = c("", "", "", "", "a", "c", "a", "a",
"", "c", "b", "c"), integer = c(4L, 1L, -2L, 2L, 3L, 1L, -1L,
-1L, 4L, 0L, -2L, -2L), Date = structure(c(19183, 19184, 19185,
19186, 19187, 19188, 19189, 19190, 19191, 19192, 19193, 19194
), class = "Date"), factor = structure(c(3L, 4L, 1L, 3L, 1L,
1L, 4L, 2L, 4L, 4L, 3L, 2L), levels = c("", "A", "B", "FOO"), class = "factor"),
POSIX = structure(c(1657476490L, 1657480090L, 1657483690L,
1657487290L, 1657490890L, 1657494490L, 1657498090L, 1657501690L,
1657505290L, 1657508890L, 1657512490L, 1657516090L), class = c("POSIXct",
"POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-12L))
Upvotes: 1
Reputation: 10432
Here are a couple dplyr
options:
library(dplyr)
# all columns:
df %>%
mutate_all(~na_if(., ''))
# specific column types:
df %>%
mutate_if(is.factor, ~na_if(., ''))
# specific columns:
df %>%
mutate_at(vars(A, B), ~na_if(., ''))
# or:
df %>%
mutate(A = replace(A, A == '', NA))
# replace can be used if you want something other than NA:
df %>%
mutate(A = as.character(A)) %>%
mutate(A = replace(A, A == '', 'used to be empty'))
Upvotes: 33
Reputation: 1309
If you want to replace multiple values in a data frame, looping through all columns might help.
Say you want to replace ""
and 100
:
na_codes <- c(100, "")
for (i in seq_along(df)) {
df[[i]][df[[i]] %in% na_codes] <- NA
}
Upvotes: 1
Reputation: 7760
We can use data.table to get it quickly. First create df without factors,
df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)), stringsAsFactors=F)
Now you can use
setDT(df)
for (jj in 1:ncol(df)) set(df, i = which(df[[jj]]==""), j = jj, v = NA)
and you can convert it back to a data.frame
setDF(df)
If you only want to use data.frame and keep factors it's more difficult, you need to work with
levels(df$value)[levels(df$value)==""] <- NA
where value is the name of every column. You need to insert it in a loop.
Upvotes: 5
Reputation: 577
Since PikkuKatja and glallen asked for a more general solution and I cannot comment yet, I'll write an answer. You can combine statements as in:
> df[df=="" | df==12] <- NA
> df
A B
1 <NA> <NA>
2 xyz <NA>
3 jkl 100
For factors, zxzak's code already yields factors:
> df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)))
> str(df)
'data.frame': 3 obs. of 2 variables:
$ A: Factor w/ 3 levels "","jkl","xyz": 1 3 2
$ B: Factor w/ 3 levels "","100","12": 3 1 2
If in trouble, I'd suggest to temporarily drop the factors.
df[] <- lapply(df, as.character)
Upvotes: 39
Reputation: 15163
Like this:
> df[df==""]<-NA
> df
A B
1 <NA> 12
2 xyz <NA>
3 jkl 100
Upvotes: 171