Jade Reynolds
Jade Reynolds

Reputation: 301

How to finish code to replace NA with median in R

I am very new to R, so please please be gentle.

I am working on the Kaggle Titanic competition, to get me into R and working things out.

I am working my way through engineering a feature and I am a bit stuck with the logic of what to do next.

So, here goes. My goal is to take the Age data and replace all of the NA with the median of age for the title of the person. e.g. if the person is a master, I want to get the median of all the masters and replace the NA with that median. Same for Mr. and so on.

I have managed to create myself a data.frame containing title and age as follows:

library(tibble)
data.combined <-
  tibble(
    data.combined.new.title = c(
      "Mr.",
      "Mrs.",
      "Miss",
      "Mrs.",
      "Mr.",
      "Mr.",
      "Mr.",
      "Master",
      "Mrs."
    ),
    data.combined.Age = c(22, 38, 26, 35, 35, NA, 54, 2, 27)
  )

enter image description here

As you can see in this list there is a Mr. with and NA next to his age. I want to replace that NA with the Median of all the other Mr in the list.

so I have the following code up to the point where I can replace the NA's with the median of the whole data set.

#Creates my data.frame
agedata <- data.frame(data.combined$new.title, data.combined$Age)

#replace NA with the mean of the whole data set
agedata$data.combined.Age[is.na(agedata$data.combined.Age)] <- median(agedata$data.combined.Age, na.rm = TRUE)

What I just don't get is how would I add to this code to replace the NA by the median of the groups of title, Mr, Master, Mrs, Miss?

Any pointers are greatly received.

I'm not too interested in whether this is going to help with my prediction for Kaggle at this point, more with how the code should look.

Many Thanks in Advance.

Upvotes: 0

Views: 2210

Answers (4)

ekstroem
ekstroem

Reputation: 6171

Or maybe this tidyverse one-liner

agedata %>% group_by(title) %>% mutate(age=ifelse(is.na(age), median(age, na.rm=TRUE), age))

Upvotes: 3

quant
quant

Reputation: 4482

library(data.table)

dt <- data.table(title = c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs"),
age = c(22, 38, 26, 35, 35, NA, 54, 2, 27))

dt[,avg_age:=median(age,na.rm=T),by="title"]
dt[is.na(age),age:=avg_age]
dt[,avg_age:=NULL]

Upvotes: 1

Jeanne Chaudanson
Jeanne Chaudanson

Reputation: 131

This is probably not the most elegent way to do it but it works:

title <- c("Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master", "Mrs")
age <- c(22, 38, 26, 35, 35, NA, 54, 2, 27)
df = data.frame(title, age)

# get the medians by groups
medians = aggregate(df$age, list(df$title), median, na.rm = TRUE)
# match the missing ages with the medians thanks to the groups
df$age[is.na(df$age)] <- medians[array(medians$Group.1) == df$title[is.na(df$age)], "x"]

Upvotes: 1

Prasanna Nandakumar
Prasanna Nandakumar

Reputation: 4335

zz <- "group traits
BSPy01-10     NA
BSPy01-10    7.3
BSPy01-10    7.3
BSPy01-11    5.3
BSPy01-11    5.4
BSPy01-11    5.6
BSPy01-11     NA
BSPy01-11     NA
BSPy01-11    4.8
BSPy01-12    8.1
BSPy01-12    6.0
BSPy01-12    6.0
BSPy01-13    6.1"
Data <- read.table(text=zz, header = TRUE)

impute <- function(x, fun) {
missing <- is.na(x)
replace(x, missing, fun(x[!missing]))
}
ddply(Data, ~ group, transform, traits = impute(traits, median))

Upvotes: 2

Related Questions