Reputation: 33
I have a dataframe with columns of data with missing value and I would like to replace the missing value by taking the mean using the value of the cells above and below.
df1<-c(2,2,NA,10, 20, NA,3)
if(df1[i]== NA){
df1[i]= mean(df1[i+1],df1[i-1])
}
However, I am getting this error
Error in if (df1[i] == NA) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In if (df1[i] == NA) { :
the condition has length > 1 and only the first element will be used
Any guidance would be appreciated to solve this issue.
Upvotes: 3
Views: 2395
Reputation: 59
It first check NA's in the respective column, if there is missing value it replaces with the mean of the column else just returns the dataset.
df$col_name <- ifelse(is.na(df$col_name), ave(df$col_name, Fun = function(x) mean(x, na.rm ==TRUE)),df$col_name)
Upvotes: 0
Reputation: 2190
to check for NAs use is.na()
, make a loop and give mean()
a vector as an argument, otherwise it will only see the first value. This should work if you have no consecutive NAs and first and last entry are non-NA:
df1<-c(2,2,NA,10, 20, NA,3)
for(i in 2:(length(df1)-1)){
if(is.na(df1[i])){
df1[i]= mean(c(df1[i+1],df1[i-1]))
}
}
Upvotes: 1
Reputation: 21621
You could use na.approx()
from the zoo
package to replace NA
with interpolated values:
library(zoo)
> na.approx(df1)
# [1] 2.0 2.0 6.0 10.0 20.0 11.5 3.0
As per mentioned by @G.Grothendieck, this will fill the NA
s if there are multiple NA
s in a row. Also if there can be NA
s at the ends then adding the argument na.rm = FALSE
will preserve them or adding rule = 2
will replace them with the first or last non-NA
.
Upvotes: 2
Reputation: 24945
Using lag and lead from dplyr
:
library(dplyr)
df1[is.na(df1)] <- (df1[is.na(lag(df1, default=""))] +
df1[is.na(lead(df1, default=""))]) / 2
This will be much faster than the for loop version
Upvotes: 2
Reputation: 206167
If you are sure you don't have any consecutive NA values and the first and last elements are never NA, then you can do
df1<-c(2,2,NA,10, 20, NA,3)
idx<-which(is.na(df1))
df1[idx] <- (df1[idx-1] + df1[idx+1])/2
df1
# [1] 2.0 2.0 6.0 10.0 20.0 11.5 3.0
This should be more efficient than a loop.
Upvotes: 3