Reputation: 3020
I want to replace NAs present in a column of a DATA TABLE with the mean of the same column. I am doing the following. But it is not working.
ww <- data.table(iris)
ww <- ww[1:5 , ]
ww[1,1] <- NA
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: NA 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
ww[is.na(Sepal.Length) , Sepal.Length:= mean(Sepal.Length, na.rm = T)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: NaN 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
Why am I getting NaN in place of NA when it should have been the mean of the rest of the values (4.9, 4.7, 4.6, 5.0)?
What is the alternate of acheiving this in case something is wrong with this syntax?
I want to the syntax for the data table.
Upvotes: 16
Views: 3241
Reputation: 7818
just one line will do the trick with fcoalesce
:
ww[, Sepal.Length := fcoalesce(Sepal.Length, mean(Sepal.Length, na.rm = TRUE))]
ww
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 4.8 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
Upvotes: 1
Reputation: 16697
While the zoo
answer is pretty nice it requires new dependency.
Using just data.table
you could do the following.
library(data.table)
# prepare data
ww = data.table(iris[1:5,])
ww[1, Sepal.Length := NA]
# solution
ww[, Sepal.Length.mean := mean(Sepal.Length, na.rm = TRUE) # calculate mean
][is.na(Sepal.Length), Sepal.Length := Sepal.Length.mean # replace NA with mean
][, Sepal.Length.mean := NULL # remove mean col
][] # just prints
While it may looks biggish comparing to zoo's, it is performance efficient as all steps are made using update by reference :=
.
It can also be easily tuned to replace NA with mean by group, just using by
argument in data.table.
Upvotes: 13
Reputation: 23608
In base R:
ww$Sepal.Length[is.na(ww$Sepal.Length)] <- mean(ww$Sepal.Length, na.rm = T)
Upvotes: 6
Reputation: 24945
tidyr
has a built in function, replace_na
you can use for this:
library(tidyr)
ww %>% replace_na(list(Sepal.Length = mean(.$Sepal.Length, na.rm = TRUE)))
Upvotes: 3
Reputation: 3274
Your attempt subsetted the table first, selecting
> ww[is.na(Sepal.Length)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:
NA 3.5 1.4 0.2 setosa
so any further operations can only 'see' these rows - i.e. Sepal.Length
can only see that one NA
.
The data.table
solution you want is below - it looks at the whole table and replaces the NA
s with the means using an ifelse
.
ww[, Sepal.Length := ifelse(is.na(Sepal.Length), mean(Sepal.Length, na.rm = TRUE), Sepal.Length)]
Upvotes: 7
Reputation: 269421
na.aggregate
in the zoo package replaces NAs with the mean of the non-NAs in the same column:
library(zoo)
ww[, Sepal.Length := na.aggregate(Sepal.Length)]
Upvotes: 21
Reputation: 2226
It is not taking the mean of the entire Sepal.Length column; only the 1 column that you have chosen.
Rather use:
ww[is.na(Sepal.Length) , Sepal.Length:= mean(ww$Sepal.Length, na.rm=TRUE)]
Upvotes: 4