Reputation: 179
I have a dataset (main dataset) which looks like this:
id cleaning_fee boro zipcode price
1 NA Manhattan 10014 100
2 70 Manhattan 10013 125
3 NA Brooklyn 11201 97
4 25 Manhattan 10012 110
5 30 Staten Island 10305 60
Grouping by Borough and Zipcode I get this (using na.rm = True):
borough zipcode avgCleaningFee
Brooklyn 11217 88.32000
Brooklyn 11231 89.05085
Brooklyn 11234 42.50000
Manhattan 10003 97.03738
Manhattan 10011 109.97647
What I want to do is impute the NAs in the 'cleaning_fee' variable in my main dataset by either:
(a) imputing the grouped mean (as shown above in table 2 where I group on 2 conditions)
or
(b) use KNN regression on variables such as zipcode, boro and the price to impute the cleaning fee variable. (PS I understand how KNN regression works but I haven't used it, would be great if you can explain the code in 1 line or so)
Would be great if anyone can help me out with this. Thanks!!
Upvotes: 1
Views: 794
Reputation: 887118
We can use the first method
library(dplyr)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = replace(Cleaning_fee,
is.na(Cleaning_fee), mean(Cleaning_fee, na.rm = TRUE))
Or with na.aggregate
from zoo
library(zoo)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = na.aggregate(cleaning_fee))
Upvotes: 2