Anurag Kaushik
Anurag Kaushik

Reputation: 179

Conditional imputation of one variable using Dplyr

I have a dataset (main dataset) which looks like this:

id cleaning_fee boro           zipcode           price
1  NA           Manhattan       10014            100
2  70           Manhattan       10013            125
3  NA           Brooklyn        11201            97
4  25           Manhattan       10012            110
5  30           Staten Island   10305            60

Grouping by Borough and Zipcode I get this (using na.rm = True):

borough   zipcode avgCleaningFee    
Brooklyn    11217   88.32000        
Brooklyn    11231   89.05085        
Brooklyn    11234   42.50000        
Manhattan   10003   97.03738        
Manhattan   10011   109.97647

What I want to do is impute the NAs in the 'cleaning_fee' variable in my main dataset by either:

(a) imputing the grouped mean (as shown above in table 2 where I group on 2 conditions)

or

(b) use KNN regression on variables such as zipcode, boro and the price to impute the cleaning fee variable. (PS I understand how KNN regression works but I haven't used it, would be great if you can explain the code in 1 line or so)

Would be great if anyone can help me out with this. Thanks!!

Upvotes: 1

Views: 794

Answers (1)

akrun
akrun

Reputation: 887118

We can use the first method

library(dplyr)
df1 %>%
   group_by(Borough, Zipcode) %>%
   mutate(cleaning_fee = replace(Cleaning_fee, 
            is.na(Cleaning_fee), mean(Cleaning_fee, na.rm = TRUE))

Or with na.aggregate from zoo

library(zoo)
df1 %>%
  group_by(Borough, Zipcode) %>%
  mutate(cleaning_fee = na.aggregate(cleaning_fee))

Upvotes: 2

Related Questions