watchtower
watchtower

Reputation: 4298

replace NA with values from another table based on groupings (not one-by-one lookup table)

My objective is to replace values in one table with values from another look-up table. There is one catch: this lookup table isn't one-by-one lookup table as discussed in Replace na's with value from another df but the lookup will be done based on multiple column groupings. As a result, if multiple entries are returned based on those groupings in the look-up table, all of them would need to be populated in the original table.

I was able to do this task but I need help with two things:

a) my code is really messy. Every time I have to do similar thing, I end up spending enormous amount of time trying to figure out what I have done, and then re-using it. So, I'd appreciate anything that's more clean and simpler.

b) It's very slow. I have multiple ifelse statements. When I run this on the actual data with 36M records, it takes a lot of time.

Here's my source with dummy data:

dput(DFile)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2", 
"G3", "G3", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5"), 
    Country_SV = c("United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "UK", 
    "UK"), Product_BU = c("Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", "Laptop"
    ), Prob_model3 = c(0, 79647405.9878251, 282615405.328728, 
    NA, NA, 363419594.065383, 0, 72870592.8458704, 260045174.088548, 
    369512727.253779, 0, 79906001.2878251, 285128278.558728, 
    405490639.873629, 234, NA), DoS.FY = c(2014, 2013, 2012, 
    NA, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
    2015, 2016, NA), Insured = c("Covered", "Covered", "Covered", 
    NA, NA, "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Covered", NA)), .Names = c("Region_SL", "Country_SV", 
"Product_BU", "Prob_model3", "DoS.FY", "Insured"), row.names = c(NA, 
16L), class = "data.frame")

Here's my grouped look-up table:

dput(Master_Joined)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G3", 
"G4", "G5", "G5", "G5"), Country_SV = c("United States", "United States", 
"United States", "United States", "United States", "United States", 
"United States", "UK", "UK", "UK"), Product_BU = c("Laptop", 
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", 
"Laptop", "Laptop"), DoS.FY = c(2014, 2013, 2012, 2015, 2015, 
2015, 2015, 2016, 2017, 2017), Insured = c("Covered", "Covered", 
"Covered", "Uncovered", "Not Covered", "Not Covered", "Not Covered", 
"Covered", "Uncovered", "Covered")), .Names = c("Region_SL", 
"Country_SV", "Product_BU", "DoS.FY", "Insured"), row.names = c(NA, 
10L), class = "data.frame")

This is "grouped" in a sense that all entries are unique.

Finally, here's my code:

#Which fields are missing?
Missing<-DFile[is.na(DFile$Prob_model3),]

Column_name<-colnames(DFile)[4]
colnames(DFile)[4]<-"temp_prob"

#Replace Prob_model3
DFile<-DFile %>%
  group_by(Region_SL, Country_SV, Product_BU) %>%
  dplyr::mutate(Average_Value = mean(temp_prob,na.rm = TRUE)) %>%
  rowwise() %>%
  dplyr::mutate(Col_name1 = ifelse(is.na(temp_prob),Average_Value,temp_prob)) %>%
  dplyr::select(Region_SL:Product_BU,DoS.FY,Insured,Col_name1)

colnames(DFile)[6]<-Column_name

  Missing$DoS.FY<-NULL

  Missing_FYear<-Missing %>% 
    inner_join(Master_Joined,by = c("Region_SL", "Country_SV", "Product_BU")) %>%
    group_by(Region_SL, Country_SV, Product_BU, DoS.FY, Insured.y) %>%
    dplyr::distinct() %>%
    left_join(Missing)

  Missing_FYear$Prob_model3<-NULL

  DFile <-DFile %>% 
    left_join(Missing_FYear,by = c("Region_SL", "Country_SV", "Product_BU", "Insured")) %>%
    dplyr::rowwise() %>%
    mutate(DoS.FY=ifelse((is.na(`DoS.FY.y`)|is.na(`DoS.FY.x`)),sum(`DoS.FY.y`,`DoS.FY.x`,na.rm=TRUE),`DoS.FY.x`), Insured_Combined = ifelse(is.na(Insured),Insured.y,Insured)) %>%
    dplyr::select(Region_SL:Product_BU,Prob_model3,DoS.FY, Insured_Combined)  

  colnames(DFile)[6]<-"Insured"
  #Check again
  Missing<-DFile[is.na(DFile$Prob_model3),] 

  if (nrow(Missing) > 1)
  { #you have NaNs, replace them with 0
    DFile[is.nan(DFile$Prob_model3),"Prob_model3"] <- 0
   }
  Missing<-DFile[is.na(DFile$Prob_model3),] 

Expected Output: DFile as after the running the code above.

I'd sincerely appreciate your help. I have been struggling with this problem for about a week now.

Upvotes: 2

Views: 1273

Answers (2)

Jonathan von Schroeder
Jonathan von Schroeder

Reputation: 1703

Another way to think about this is to only merge the rows which have missing values in DoS.FY or Insured with the master data:

#Replace missing probabilities by grouped average
DFile_new <- DFile %>% group_by(Region_SL,Country_SV,Product_BU) %>% mutate(Prob_model3 = coalesce(Prob_model3,mean(Prob_model3, na.rm = T))) %>% ungroup()
#This leads to one NaN because for
# 16        G5            UK     Laptop          NA     NA        <NA>
#there are no other rows in the same group
DFile_new$Prob_model3[is.nan(DFile_new$Prob_model3)] <- 0

#Split dataset into two parts
#1) The part that has no NA's in DoS.FY and Insured
DFile_new1 <- filter(DFile_new,!is.na(DoS.FY) & !is.na(Insured))
#2) The part has NA's in either DoS.FY or Insured
DFile_new2 <- filter(DFile_new,is.na(DoS.FY) | is.na(Insured))

#merge DFile_new2 and Master_Joined
DFile_new2 <- merge(DFile_new2,Master_Joined,by=c("Region_SL","Country_SV","Product_BU")) %>%
  mutate(DoS.FY.x = coalesce(DoS.FY.x,DoS.FY.y), Insured.x = coalesce(Insured.x,Insured.y)) %>%
  select(-Insured.y,-DoS.FY.y) %>% rename(Insured=Insured.x, DoS.FY = DoS.FY.x)

#Put all rows in frame
my_out_new <- rbind(DFile_new1,DFile_new2)

This yields the same result as OP's code (albeit in a different order):

> compare <- function(df1,df2) {
+   idx1 <- c()
+   idx2 <- c()
+   for(i in 1:nrow(df1)) {
+     found <- FALSE
+     for(j in 1:nrow(df2)) {
+       if(!(j %in% idx2)) {
+         idx = as.logical(df1[i,] != df2[j,])
+         d <- suppressWarnings(abs(as.numeric(df1[i,idx])-as.numeric(df2[j,idx]))) < 1e-5
+         if(!(any(is.na(d))) & all(d)) {
+           idx1 <- c(idx1,i)
+           idx2 <- c(idx2,j)
+           break;
+         }
+       }
+     }
+   }
+   rbind(idx1,idx2)
+ }
> compare(my_out,my_out_new)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
idx1    1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17
idx2    1    2    3   14   15   16   17    4   18     5     6     7     8     9    10    11    12
     [,18] [,19] [,20]
idx1    18    19    20
idx2    13    19    20

(where my_out is the resulting DFile of the OP's code)

Upvotes: 0

Sotos
Sotos

Reputation: 51592

An idea is to find the Region_SL which have NA. Once we do we use plyr's rbind.fill to rbind to a new_df. We then filter out any rows with NA (except on last column - col 6). We create a new variable Prob_model4 which holds the means per group of Region_SL. We then use coalesce to "merge" the two columns.

library(dplyr)
ind <- unique(as.integer(which(is.na(DFile), arr.ind = TRUE)[,1]))
new_df <- plyr::rbind.fill(Master_joined[Master_joined$Region_SL %in% DFile$Region_SL[ind],], DFile)

new_df %>% 
  arrange(Region_SL, Prob_model3) %>% 
  filter(complete.cases(.[-6])) %>% 
  group_by(Region_SL) %>% 
  mutate(Prob_model3 = replace(Prob_model3, is.na(Prob_model3), mean(Prob_model3, na.rm = T))) %>%  
  ungroup()

# A tibble: 21 × 6
#   Region_SL    Country_SV Product_BU DoS.FY     Insured Prob_model3
#       <chr>         <chr>      <chr>  <dbl>       <chr>       <dbl>
#1         G1 United States     Laptop   2014     Covered           0
#2         G1 United States     Laptop   2013     Covered    79647406
#3         G1 United States     Laptop   2012     Covered   282615405
#4         G1 United States     Laptop   2014     Covered   120754270
#5         G1 United States     Laptop   2013     Covered   120754270
#6         G1 United States     Laptop   2012     Covered   120754270
#7         G1 United States     Laptop   2015   Uncovered   120754270
#8         G2 United States     Laptop   2015 Not Covered   363419594
#9         G2 United States     Laptop   2015 Not Covered   363419594
#10        G3 United States     Laptop   2015 Not Covered           0
# ... with 11 more rows

Upvotes: 2

Related Questions