PaulBeales
PaulBeales

Reputation: 505

Impute variables within a data.frame group by factor column

I have a data.frame contain numeric columns, these columns have factor levels that I want to impute missing values by...let me explain.

part   id   value
a      1     23.4
a      2     23.8
a      3     45.6
a      4     34.7
a      5     Na
b      1     45.2
b      2     34.6
b      3     Na
b      4     30.9
b      5     28.1

Id like to impute the NA values with the mean of the part. So for part a, I'd like to impute the id 5 missing value with the mean of ids 1-4 in part a, and same for part b, impute missing id3 with the mean of ids in part b etc.

I need to do this across many columns (imagine having many more value columns). So perhaps an apply with a function etc.

Upvotes: 3

Views: 2581

Answers (1)

akrun
akrun

Reputation: 887118

Using na.strings argument in read.table/read.csv we can convert the missing values to real NA and thereby reading the 'value' columns as 'numeric'. With dplyr, we can change replace the NAs in multiple value columns with mean of that column.

library(dplyr)
df1 %>%
    group_by(part) %>%
    mutate_each(funs(replace(., which(is.na(.)), mean(., na.rm=TRUE))), 
       starts_with('value'))

Or a similar option with data.table

library(data.table)
nm1 <- grep('value', names(df1))
setDT(df1)[, (nm1) := lapply(.SD,  function(x) replace(x,
     which(is.na(x)), mean(x, na.rm=TRUE))), by = part,.SDcols=nm1]

data

df1 <- read.table(text="part   id   value
a      1     23.4
a      2     23.8
a      3     45.6
a      4     34.7
a      5     Na
b      1     45.2
b      2     34.6
b      3     Na
b      4     30.9
b      5     28.1", header=TRUE, na.strings="Na", stringsAsFactors=FALSE)

Upvotes: 2

Related Questions