replace NA with values from another table based on groupings (not one-by-one lookup table)

Question

My objective is to replace values in one table with values from another look-up table. There is one catch: this lookup table isn't one-by-one lookup table as discussed in Replace na's with value from another df but the lookup will be done based on multiple column groupings. As a result, if multiple entries are returned based on those groupings in the look-up table, all of them would need to be populated in the original table.

I was able to do this task but I need help with two things:

a) my code is really messy. Every time I have to do similar thing, I end up spending enormous amount of time trying to figure out what I have done, and then re-using it. So, I'd appreciate anything that's more clean and simpler.

b) It's very slow. I have multiple ifelse statements. When I run this on the actual data with 36M records, it takes a lot of time.

Here's my source with dummy data:

dput(DFile)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2", 
"G3", "G3", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5"), 
    Country_SV = c("United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "UK", 
    "UK"), Product_BU = c("Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", "Laptop"
    ), Prob_model3 = c(0, 79647405.9878251, 282615405.328728, 
    NA, NA, 363419594.065383, 0, 72870592.8458704, 260045174.088548, 
    369512727.253779, 0, 79906001.2878251, 285128278.558728, 
    405490639.873629, 234, NA), DoS.FY = c(2014, 2013, 2012, 
    NA, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
    2015, 2016, NA), Insured = c("Covered", "Covered", "Covered", 
    NA, NA, "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Covered", NA)), .Names = c("Region_SL", "Country_SV", 
"Product_BU", "Prob_model3", "DoS.FY", "Insured"), row.names = c(NA, 
16L), class = "data.frame")

Here's my grouped look-up table:

dput(Master_Joined)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G3", 
"G4", "G5", "G5", "G5"), Country_SV = c("United States", "United States", 
"United States", "United States", "United States", "United States", 
"United States", "UK", "UK", "UK"), Product_BU = c("Laptop", 
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", 
"Laptop", "Laptop"), DoS.FY = c(2014, 2013, 2012, 2015, 2015, 
2015, 2015, 2016, 2017, 2017), Insured = c("Covered", "Covered", 
"Covered", "Uncovered", "Not Covered", "Not Covered", "Not Covered", 
"Covered", "Uncovered", "Covered")), .Names = c("Region_SL", 
"Country_SV", "Product_BU", "DoS.FY", "Insured"), row.names = c(NA, 
10L), class = "data.frame")

This is "grouped" in a sense that all entries are unique.

Finally, here's my code:

#Which fields are missing?
Missing<-DFile[is.na(DFile$Prob_model3),]

Column_name<-colnames(DFile)[4]
colnames(DFile)[4]<-"temp_prob"

#Replace Prob_model3
DFile<-DFile %>%
  group_by(Region_SL, Country_SV, Product_BU) %>%
  dplyr::mutate(Average_Value = mean(temp_prob,na.rm = TRUE)) %>%
  rowwise() %>%
  dplyr::mutate(Col_name1 = ifelse(is.na(temp_prob),Average_Value,temp_prob)) %>%
  dplyr::select(Region_SL:Product_BU,DoS.FY,Insured,Col_name1)

colnames(DFile)[6]<-Column_name

  Missing$DoS.FY<-NULL

  Missing_FYear<-Missing %>% 
    inner_join(Master_Joined,by = c("Region_SL", "Country_SV", "Product_BU")) %>%
    group_by(Region_SL, Country_SV, Product_BU, DoS.FY, Insured.y) %>%
    dplyr::distinct() %>%
    left_join(Missing)

  Missing_FYear$Prob_model3<-NULL

  DFile <-DFile %>% 
    left_join(Missing_FYear,by = c("Region_SL", "Country_SV", "Product_BU", "Insured")) %>%
    dplyr::rowwise() %>%
    mutate(DoS.FY=ifelse((is.na(`DoS.FY.y`)|is.na(`DoS.FY.x`)),sum(`DoS.FY.y`,`DoS.FY.x`,na.rm=TRUE),`DoS.FY.x`), Insured_Combined = ifelse(is.na(Insured),Insured.y,Insured)) %>%
    dplyr::select(Region_SL:Product_BU,Prob_model3,DoS.FY, Insured_Combined)  

  colnames(DFile)[6]<-"Insured"
  #Check again
  Missing<-DFile[is.na(DFile$Prob_model3),] 

  if (nrow(Missing) > 1)
  { #you have NaNs, replace them with 0
    DFile[is.nan(DFile$Prob_model3),"Prob_model3"] <- 0
   }
  Missing<-DFile[is.na(DFile$Prob_model3),]

Expected Output: DFile as after the running the code above.

I'd sincerely appreciate your help. I have been struggling with this problem for about a week now.

Sotos · Accepted Answer

An idea is to find the Region_SL which have NA. Once we do we use plyr's rbind.fill to rbind to a new_df. We then filter out any rows with NA (except on last column - col 6). We create a new variable Prob_model4 which holds the means per group of Region_SL. We then use coalesce to "merge" the two columns.

library(dplyr)
ind <- unique(as.integer(which(is.na(DFile), arr.ind = TRUE)[,1]))
new_df <- plyr::rbind.fill(Master_joined[Master_joined$Region_SL %in% DFile$Region_SL[ind],], DFile)

new_df %>% 
  arrange(Region_SL, Prob_model3) %>% 
  filter(complete.cases(.[-6])) %>% 
  group_by(Region_SL) %>% 
  mutate(Prob_model3 = replace(Prob_model3, is.na(Prob_model3), mean(Prob_model3, na.rm = T))) %>%  
  ungroup()

# A tibble: 21 × 6
#   Region_SL    Country_SV Product_BU DoS.FY     Insured Prob_model3
#                                      
#1         G1 United States     Laptop   2014     Covered           0
#2         G1 United States     Laptop   2013     Covered    79647406
#3         G1 United States     Laptop   2012     Covered   282615405
#4         G1 United States     Laptop   2014     Covered   120754270
#5         G1 United States     Laptop   2013     Covered   120754270
#6         G1 United States     Laptop   2012     Covered   120754270
#7         G1 United States     Laptop   2015   Uncovered   120754270
#8         G2 United States     Laptop   2015 Not Covered   363419594
#9         G2 United States     Laptop   2015 Not Covered   363419594
#10        G3 United States     Laptop   2015 Not Covered           0
# ... with 11 more rows

replace NA with values from another table based on groupings (not one-by-one lookup table)

Answers (2)

Related Questions