stackoverflowuser2010
stackoverflowuser2010

Reputation: 40869

Replace NA with mean based on row subset matching another column?

I have data where each row contains a person's gender and weight (in lbs):

genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE")
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA)

df <- data.frame(gender=genders, weight=weights)
df
#   gender weight
# 1 FEMALE    110
# 2 FEMALE    120
# 3 FEMALE    112
# 4 FEMALE     NA
# 5 FEMALE     NA
# 6   MALE    190
# 7   MALE    202
# 8   MALE    195
# 9   MALE     NA

For every row that has an NA in the weight column, I would like to replace / impute the NA with the weight mean, but the mean should be calculated using only the rows that match the same gender value as the row with the NA.

Specifically, rows 4 and 5 have a gender of FEMALE and weight of NA. I want to replace the NA with the mean weight computed over the subset of rows that match the gender of FEMALE. In this case, the mean would be (110+120+112)/3=114.0 from the other rows 1, 2, and 3.

Likewise, I want to replace the NA in row 9 with the mean of the weights for MALE gender.

I tried the following command, but it replaced the NA with the mean weight over all users across both genders, which is not what I want.

df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T)
df
#   gender   weight
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 154.8333
# 5 FEMALE 154.8333
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 154.8333

I searched other questions, but they are not quite the same problem as mine:

"Replace NA with mean matching the same ID"

"How to replace NA with mean by subset in R (impute with plyr?)"

"How to replace NA values in a table for selected columns? data.frame, data.table"

Upvotes: 4

Views: 2991

Answers (4)

akrun
akrun

Reputation: 886938

This can be easily done using na.aggregate from zoo. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'gender', we apply the na.aggregate to 'weight' to replace the NA elements with the mean value. By default, the na.aggregate return the mean, but we can also change the FUN argument to get median or sum etc.

library(data.table)
library(zoo)
setDT(df)[, weight := na.aggregate(weight) , by = gender]

Or with ave from base R

with(df, ave(weight, gender, FUN = na.aggregate))
#[1] 110.0000 120.0000 112.0000 114.0000 114.0000 190.0000 202.0000 195.0000 195.6667

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99321

You could use ave() with replace() (or standard manual replacement).

df$weight <- with(df, ave(weight, gender,
    FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))))

which gives

  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

Upvotes: 7

Warner
Warner

Reputation: 1363

Using base R this appears to be what you're looking for:

df$weight[df$gender=="FEMALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="FEMALE"], na.rm=TRUE)
df$weight[df$gender=="MALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="MALE"], na.rm=TRUE)

> df
  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

Upvotes: 2

akuiper
akuiper

Reputation: 214927

You can group your data frame by gender and then calculate the average of weight and replace the NA with ifelse statement, in dplyr, it could be:

library(dplyr)
df %>% 
      group_by(gender) %>% 
      mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight))

# Source: local data frame [9 x 2]
# Groups: gender [2]

#  gender   weight
#  <fctr>    <dbl>
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 114.0000
# 5 FEMALE 114.0000
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 195.6667

Upvotes: 5

Related Questions