Reputation: 2644
I would like to replace missing values in my data.table for each group and fill in value depending on whether all values in group are missing, or just some values in group are missing.
I can solve the problem but am open to a better code(in terms of speed/memory/readability/flexibility).
I'm stubborn and i'd prefer the data.table solution :)
it is a data.table with this structure:
dt = data.table(
grouping_1 = sort(rep(c('a', 'b', 'c'), 4)),
grouping_2 = c(1,1,2,2,1,1,2,2,1,1,2,2),
value_1 = c(NA, NA, NA, NA, NA, 1, 2, NA, 3, 2,4,NA),
value_2 = c(NA, 2, NA, NA, 2, 5, 2, 7, 10, 5,NA, NA)
)
That looks like this:
grouping_1 grouping_2 value_1 value_2
1: a 1 NA NA
2: a 1 NA 2
3: a 2 NA NA
4: a 2 NA NA
5: b 1 NA 2
6: b 1 1 5
7: b 2 2 2
8: b 2 NA 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 NA
12: c 2 NA NA
I want to group it by columns grouping_1
and grouping_2
and replace missing values in in columns value_1
and value_2
.
If there is no non-missing value for given group (e.g. group grrouping_1==a & grouping_2==1
), i want to replace all NAs for this group by the value of 9000.
If there are some non-missing values for given group, i want to replace missing values by 800 if grouping_2==1
and replace by -800 (negative 800) if grouping_2==2
. If the value is not missing, i do not want to change it.
I wrote the following function, which i then apply to each column where i want to fill in missing values. The function changes the original dataset by reference:
filler_so = function(
data, # the dataset that we will be changing
column, # the column we will be filling in
placeholder_col ='drop_at_the_end', # some temporary column that will disappear in the end
missing_fully = 9000, # value to fill in when all values in group missing
missing_partially_g2_1 = 800, # value to fill when grouping_2 = 1
missing_partially_g2_2 = -800, # value to fill when grouping_2 = 2
g2_col = 'grouping_2', # name of column corresponding to grouping_2 from my example
group_cols = c('grouping_1', 'grouping_2') # names of columns to group by
){
# identify for given column whether all values in group are missing,
# or only some are misisng. The value will be either Infinity (all missig),
# or a real number (none or some missing).
# this info is put in a placeholder column
data[, (placeholder_col) := min(get(column), na.rm = T), by = group_cols]
# if value on a given row is missing, but not all missing in group,
# then fill in the values based on what group is in 2nd grouping column
data[
is.na(get(column)) & (get(placeholder_col) != Inf),
(placeholder_col) := (get(g2_col) == 2) * missing_partially_g2_2 +
(get(g2_col) ==1) * missing_partially_g2_1]
# if all values in group are missing, fill in the "missing_fully" value
data[get(placeholder_col) == Inf, (placeholder_col) := missing_fully]
# put into placeholder column the values that were originally not missing
data[!is.na(get(column)), (placeholder_col) := get(column)]
# drop the original column
data[, (column):=NULL]
# rename the placeholder column to the name of original column
setnames(data, placeholder_col, column)
# if i don't put this here,
# then sometimes the function doesn't return results properly.
# i have no clue why.
data
}
To apply this function i need to identify the columns to be filled, which i do like this:
cols_to_fill = colnames(dt)[grep('^value', colnames(dt))]
And lapply like so:
lapply(cols_to_fill, function(x) filler_so(dt, x))
> dt
grouping_1 grouping_2 value_1 value_2
1: a 1 9000 800
2: a 1 9000 2
3: a 2 9000 9000
4: a 2 9000 9000
5: b 1 800 2
6: b 1 1 5
7: b 2 2 2
8: b 2 -800 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 9000
12: c 2 -800 9000
grouping_2
dt[..., (some_column_names) := lapply(.SD, ...), .SDcols = cols_to_fill]
Upvotes: 2
Views: 684
Reputation: 389235
This is still verbose but uses .SDcols
:
library(data.table)
cols <- grep('^value', colnames(dt), value = TRUE)
dt[, (cols) := lapply(.SD, function(x) {
#Check NA values once
tmp <- is.na(x)
#If no non-NA value
if(all(tmp)) return(9000)
#If some missing values
if(any(tmp)) {
#If grouping2 is 1
if(first(grouping_2) == 1)
replace(x, tmp, 800)
else
replace(x, tmp, -800)
}
else x
}), .(grouping_1, grouping_2), .SDcols = cols]
dt
# grouping_1 grouping_2 value_1 value_2
# 1: a 1 9000 800
# 2: a 1 9000 2
# 3: a 2 9000 9000
# 4: a 2 9000 9000
# 5: b 1 800 2
# 6: b 1 1 5
# 7: b 2 2 2
# 8: b 2 -800 7
# 9: c 1 3 10
#10: c 1 2 5
#11: c 2 4 9000
#12: c 2 -800 9000
Upvotes: 1
Reputation: 41260
Try:
replace_NA <- function(v,grouping_2) {
na_v = is.na(v)
if (sum(na_v) == length(v)) {
return(rep(9000,length(v)))
} else {
v[na_v] <- ifelse(grouping_2 == 1, 800,-800)
return(v)
}
}
dt[, c("v1_new","v2new") :=.( replace_NA(value_1,grouping_2),
replace_NA(value_2,grouping_2))
,by=.(grouping_1,grouping_2)]
Upvotes: 3