Reputation: 1215
I'm trying to impute missing values in a data frame specific column.
My intention is to replace it by means of groups of other column.
I've saved aggregated results using aggregate
:
# Replace LotFrontage missing values by Neighborhood mean
lot_frontage_by_neighborhood = aggregate(LotFrontage ~ Neighborhood, combined, mean)
And now I want to implement something like this:
for key, group in lot_frontage_by_neighborhood:
idx = (combined["Neighborhood"] == key) & (combined["LotFrontage"].isnull())
combined[idx, "LotFrontage"] = group.median()
This is of course a python code.
Not sure how to achieve this in R, can someone please help?
For example:
Neighborhood LotFrontage
A 20
A 30
B 20
B 50
A <NA>
NA Record should be replace with 25 (Average LotFrontage of all records in Neighborhood A)
Thanks
Upvotes: 0
Views: 82
Reputation:
Is this the idea you are looking for? You may need the which() function to determine which rows have NA values.
set.seed(1)
Neighborhood = sample(letters[1:4], 10, TRUE)
LotFrontage = rnorm(10,0,1)
LotFrontage[sample(10, 2)] = NA
# This data frame has 2 columns. LotFrontage column has 10 missing values.
df = data.frame(Neighborhood = Neighborhood, LotFrontage = LotFrontage)
# Sets the missing values in the Neighborhood column to the mean of the LotFrontage values from the rows with that Neighborhood
x<-df[which(is.na(df$LotFrontage)),]$Neighborhood
f<-function(x) mean(df[(df$Neighborhood==x),]$LotFrontage, na.rm =TRUE)
df[which(is.na(df$LotFrontage)),]$LotFrontage <- lapply(x,f)
Upvotes: 1