pipts
pipts

Reputation: 87

how fill in NAs (mean/median) in a dataframe depending on the values of another dataframe?

I have a df and I wanted to run the Shapiro test for every unique ID regarding their values in the column Value and I put the results in a data table (dt_table). I also wrote a few lines of code in order to replace the NAs with the mean for every unique ID again. But what I really want in to check on the dt_table and if the column "accept_H1" is TRUE then replace the NA with the median but if FALSE to replace the NA with the mean

dput(df)
structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1", 
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", 
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4", 
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9, 
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868, 
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA, 
53.6)), class = "data.frame", row.names = c(NA, -32L))

this is the shapiro test and the data table where I put the results.

dt_list <- by(df, df$ID, function(sub) {
  results <- shapiro.test(sub$Values)
  
  data.table(
    ID = sub$ID[1],
    Wilk = results$statistic,
    p_value = results$p.value,
    accept_H1 = results$p.value <= 0.05
  )
})

dt_table <- data.table::rbindlist(dt_list)

these are a few lines of code that replace all the NAs with the mean of every ID. I want to check the data_table and for example , if for F1, the column "accept_H1" is TRUE then i want to replace the NAs for the F1 with the median , but it is FALSE then i want to replace them with the mean. Can anybody help with that?

df %>% 
  group_by(ID) %>% 
  mutate(Values = ifelse(is.na(Values), mean(Values,na.rm=TRUE), Values))

Upvotes: 1

Views: 72

Answers (1)

akrun
akrun

Reputation: 887048

We may need to do a join or merge with the 'dt_table' and then use if/else condition. It may be better to use na.aggregate from zoo as this is more easier

library(data.table)
library(zoo)
setDT(df)[dt_table[, .(ID, accept_H1)], accept_H1 := accept_H1, on = .(ID)]
df[,Values := if(first(accept_H1)) na.aggregate(Values, FUN = median) else 
           na.aggregate(Values), ID][, accept_H1 := NULL][]

-output

> df
    ID      Values
 1: F1    9.600000
 2: F1    9.883333
 3: F1   10.200000
 4: F1    9.800000
 5: F1    9.900000
 6: F1    9.900000
 7: F1    9.900000
 8: F2    1.200000
 9: F2    1.200000
10: F2    1.800000
11: F2    1.500000
12: F2    1.500000
13: F2    1.600000
14: F2    1.400000
15: F2    1.457143
16: F3 3266.000000
17: F3 3256.000000
18: F3 7044.000000
19: F3 6868.000000
20: F3 3410.000000
21: F3 3405.000000
22: F3 3410.000000
23: F3 3410.000000
24: F3 5567.000000
25: F4   59.400000
26: F4   56.000000
27: F4   52.800000
28: F4   52.400000
29: F4   55.500000
30: F4   54.950000
31: F4   54.950000
32: F4   53.600000
    ID      Values

Or this may be done without merging also. i.e. we can match the grouping value with the corresponding 'ID' from 'dt_table', use the index to extract the 'accept_H1' to be used in if/else condition

library(dplyr)
df %>% 
    group_by(ID) %>%
    mutate(Values = if(dt_table$accept_H1[match(cur_group()$ID, 
         dt_table$ID)]) na.aggregate(Values, FUN = median) else
           na.aggregate(Values)) %>% 
    ungroup
# A tibble: 32 x 2
   ID    Values
   <chr>  <dbl>
 1 F1      9.6 
 2 F1      9.88
 3 F1     10.2 
 4 F1      9.8 
 5 F1      9.9 
 6 F1      9.9 
 7 F1      9.9 
 8 F2      1.2 
 9 F2      1.2 
10 F2      1.8 
# … with 22 more rows

Upvotes: 1

Related Questions