Reputation: 2644

R data.table: replace missing values by group by value depending on number of missing values in group

I would like to replace missing values in my data.table for each group and fill in value depending on whether all values in group are missing, or just some values in group are missing.

I can solve the problem but am open to a better code(in terms of speed/memory/readability/flexibility).

I'm stubborn and i'd prefer the data.table solution :)

example:

Starting dataset:

it is a data.table with this structure:

dt = data.table(
  grouping_1 = sort(rep(c('a', 'b', 'c'), 4)),
  grouping_2 = c(1,1,2,2,1,1,2,2,1,1,2,2),
  value_1 = c(NA, NA, NA, NA, NA, 1, 2, NA, 3, 2,4,NA),
  value_2 = c(NA, 2, NA, NA, 2, 5, 2, 7, 10, 5,NA, NA)
)

That looks like this:

    grouping_1 grouping_2 value_1 value_2
 1:          a          1      NA      NA
 2:          a          1      NA       2
 3:          a          2      NA      NA
 4:          a          2      NA      NA
 5:          b          1      NA       2
 6:          b          1       1       5
 7:          b          2       2       2
 8:          b          2      NA       7
 9:          c          1       3      10
10:          c          1       2       5
11:          c          2       4      NA
12:          c          2      NA      NA

What i want to do with is:

I want to group it by columns grouping_1 and grouping_2 and replace missing values in in columns value_1 and value_2.

If there is no non-missing value for given group (e.g. group grrouping_1==a & grouping_2==1), i want to replace all NAs for this group by the value of 9000.

If there are some non-missing values for given group, i want to replace missing values by 800 if grouping_2==1 and replace by -800 (negative 800) if grouping_2==2. If the value is not missing, i do not want to change it.

How can i do it now:

I wrote the following function, which i then apply to each column where i want to fill in missing values. The function changes the original dataset by reference:

filler_so = function(
  data, # the dataset that we will be changing
  column, # the column we will be filling in
  placeholder_col ='drop_at_the_end', # some temporary column that will disappear in the end
  missing_fully = 9000, # value to fill in when all values in group missing
  missing_partially_g2_1 = 800, # value to fill when grouping_2 = 1
  missing_partially_g2_2 = -800, # value to fill when grouping_2 = 2
  g2_col = 'grouping_2', # name of column corresponding to grouping_2 from my example
  group_cols = c('grouping_1', 'grouping_2') # names of columns to group by
  ){
  
  # identify for given column whether all values in group are missing,
  # or only some are misisng. The value will be either Infinity (all missig),
  # or a real number (none or some missing).
  # this info is put in a placeholder column
  data[, (placeholder_col) := min(get(column), na.rm = T), by = group_cols]
  
  # if value on a given row is missing, but not all missing in group,
  # then fill in the values based on what group is in 2nd grouping column
  data[
    is.na(get(column)) & (get(placeholder_col) != Inf),
    (placeholder_col) := (get(g2_col) == 2) * missing_partially_g2_2 +
      (get(g2_col) ==1) * missing_partially_g2_1]
    
  # if all values in group are missing, fill in the "missing_fully" value
  data[get(placeholder_col) == Inf, (placeholder_col) := missing_fully]
  
  # put into placeholder column the values that were originally not missing
  data[!is.na(get(column)), (placeholder_col) := get(column)]
  # drop the original column
  data[, (column):=NULL]
  # rename the placeholder column to the name of original column
  setnames(data, placeholder_col, column)
  
  # if i don't put this here,
  # then sometimes the function doesn't return results properly.
  # i have no clue why.
  data
}

To apply this function i need to identify the columns to be filled, which i do like this:

cols_to_fill = colnames(dt)[grep('^value', colnames(dt))]

And lapply like so:

lapply(cols_to_fill, function(x) filler_so(dt, x))

The result:

> dt
    grouping_1 grouping_2 value_1 value_2
 1:          a          1    9000     800
 2:          a          1    9000       2
 3:          a          2    9000    9000
 4:          a          2    9000    9000
 5:          b          1     800       2
 6:          b          1       1       5
 7:          b          2       2       2
 8:          b          2    -800       7
 9:          c          1       3      10
10:          c          1       2       5
11:          c          2       4    9000
12:          c          2    -800    9000

What i would like to improve:

My function works but is quite verbose, i was hoping i could turn the code into less lines
The function if not very flexible - it would be better to pass something like a named vector to specify the replacing logic and values to fill based on grouping_2
I'm looking for speed and memory gains. (for instance, there might be faster way how to identify groups with all values missing, then to run `min(..., na.rm = TRUE) and then check when it is Infinity.
lapply prints out the changed data.table for each column i'm filling in, which spams the console quite a bit.
even if the problem at 4 is fixed, i would like to know if there is a way how instead of lapply i could do this using something like dt[..., (some_column_names) := lapply(.SD, ...), .SDcols = cols_to_fill]
i am open to any other improvements i haven't thought of.

Upvotes: 2

Answers (2)

Ronak Shah

Reputation: 389235

This is still verbose but uses .SDcols :

library(data.table)
cols <- grep('^value', colnames(dt), value = TRUE)

dt[, (cols) := lapply(.SD, function(x) {
        #Check NA values once
        tmp <- is.na(x)
        #If no non-NA value
        if(all(tmp)) return(9000)
        #If some missing values
        if(any(tmp)) {
         #If grouping2 is 1
         if(first(grouping_2) == 1) 
           replace(x, tmp, 800)
         else 
           replace(x, tmp, -800)
         }
        else x
  }), .(grouping_1, grouping_2), .SDcols = cols]


dt
#    grouping_1 grouping_2 value_1 value_2
# 1:          a          1    9000     800
# 2:          a          1    9000       2
# 3:          a          2    9000    9000
# 4:          a          2    9000    9000
# 5:          b          1     800       2
# 6:          b          1       1       5
# 7:          b          2       2       2
# 8:          b          2    -800       7
# 9:          c          1       3      10
#10:          c          1       2       5
#11:          c          2       4    9000
#12:          c          2    -800    9000

Upvotes: 1

Waldi

Reputation: 41260

Try:

replace_NA <- function(v,grouping_2) {
  na_v = is.na(v)
  if (sum(na_v) == length(v)) {
    return(rep(9000,length(v)))
  } else {
    v[na_v] <- ifelse(grouping_2 == 1, 800,-800)
    return(v)
  }
}
  
dt[, c("v1_new","v2new") :=.( replace_NA(value_1,grouping_2),
                              replace_NA(value_2,grouping_2))
                              ,by=.(grouping_1,grouping_2)]

Upvotes: 3