Reputation: 33

replacing a missing value in R with average value

I have a dataframe with columns of data with missing value and I would like to replace the missing value by taking the mean using the value of the cells above and below.

 df1<-c(2,2,NA,10, 20, NA,3)
 if(df1[i]== NA){
 df1[i]= mean(df1[i+1],df1[i-1])
}

However, I am getting this error

  Error in if (df1[i] == NA) { : missing value where TRUE/FALSE needed
  In addition: Warning message:
  In if (df1[i] == NA) { :
  the condition has length > 1 and only the first element will be used

Any guidance would be appreciated to solve this issue.

Upvotes: 3

Answers (5)

Parthiban M

Reputation: 59

It first check NA's in the respective column, if there is missing value it replaces with the mean of the column else just returns the dataset.

df$col_name <- ifelse(is.na(df$col_name), ave(df$col_name, Fun = function(x) mean(x, na.rm ==TRUE)),df$col_name)

Upvotes: 0

mts

Reputation: 2190

to check for NAs use is.na(), make a loop and give mean() a vector as an argument, otherwise it will only see the first value. This should work if you have no consecutive NAs and first and last entry are non-NA:

df1<-c(2,2,NA,10, 20, NA,3)
for(i in 2:(length(df1)-1)){
  if(is.na(df1[i])){
     df1[i]= mean(c(df1[i+1],df1[i-1]))
  }
}

Upvotes: 1

Steven Beaupré

Reputation: 21621

You could use na.approx() from the zoo package to replace NA with interpolated values:

library(zoo)
> na.approx(df1)
# [1]  2.0  2.0  6.0 10.0 20.0 11.5  3.0

As per mentioned by @G.Grothendieck, this will fill the NAs if there are multiple NAs in a row. Also if there can be NAs at the ends then adding the argument na.rm = FALSE will preserve them or adding rule = 2 will replace them with the first or last non-NA.

Upvotes: 2

jeremycg

Reputation: 24945

Using lag and lead from dplyr:

library(dplyr)

df1[is.na(df1)] <- (df1[is.na(lag(df1, default=""))] +          
                    df1[is.na(lead(df1, default=""))]) / 2

This will be much faster than the for loop version

Upvotes: 2

MrFlick

Reputation: 206167

If you are sure you don't have any consecutive NA values and the first and last elements are never NA, then you can do

df1<-c(2,2,NA,10, 20, NA,3)
idx<-which(is.na(df1))
df1[idx] <- (df1[idx-1] + df1[idx+1])/2
df1
# [1]  2.0  2.0  6.0 10.0 20.0 11.5  3.0

This should be more efficient than a loop.

Upvotes: 3

replacing a missing value in R with average value

Answers (5)

Related Questions