chipsin
chipsin

Reputation: 675

Automating coalescing of data

I am regularly faced with data frames which have a number of columns which need to be coalesced into a single column. The order of coalescing is based on the number of valid entries, with the variable with the most valid entries getting precedence. I have written code to provide a count of valid entries for each variable, and based on this output I manually define the order that the variables are to be coalesced. I was wondering if there is a way to automate this step so that I do not need to manually type the order to coalesce? Below is an example dataset.

df <- data.frame(1:16)
df$A <- c(10,0,0,0,0,0,12,12,0,14,NA_real_,14,NA_real_,NA_real_,16,16)
df$B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)

 A  B  C
10 10 10
 0  0 12
 0  0 14
 0  0 16
 0 12 10
 0 12 12
12 12 14
12 12 16
 0  0 10
14 14 12
NA NA 14
14 14 16
NA 16 10
NA 16 12
16 16 14
16 16 16

I have been determining which variable has the most valid entries using the code below (NA values and 0 are treated the same):

sum(df$A > 0, na.rm = TRUE)
   #7
sum(df$B > 0, na.rm = TRUE)
   #11 
sum(df$C > 0, na.rm = TRUE)
   #16

To coalesce the data I have used the code:

df %>% mutate(new_value = coalesce(C,B,A))

A   B   C  new_value
10  10  10        10
0   0   12        12
0   0   14        14
0   0   16        16
0   12  10        10
0   12  12        12
12  12  14        14
12  12  16        16
0   0   10        10
14  14  12        12
NA  NA  14        14
14  14  16        16
NA  16  10        10
NA  16  12        12
16  16  14        14
16  16  16        16

Upvotes: 3

Views: 55

Answers (2)

tmfmnk
tmfmnk

Reputation: 40171

One dplyr possibility could be:

df %>%
 mutate(new_value = pull(select(., which.max(across(everything(), ~ sum(. > 0, na.rm = TRUE))))))

    A  B  C new_value
1  10 10 10        10
2   0  0 12        12
3   0  0 14        14
4   0  0 16        16
5   0 12 10        10
6   0 12 12        12
7  12 12 14        14
8  12 12 16        16
9   0  0 10        10
10 14 14 12        12
11 NA NA 14        14
12 14 14 16        16
13 NA 16 10        10
14 NA 16 12        12
15 16 16 14        14
16 16 16 16        16

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

You can use colSums to get count of number of values greater than 0 in df, sort them in decreasing order and store the column names (cols). Subset the data based on cols and use coalesce with do.call.

cols <- names(sort(colSums(df[-1] > 0, na.rm  =TRUE), decreasing = TRUE))
df$new_value <- do.call(dplyr::coalesce, df[cols])

We use -1 to exclude the first column which has values 1:16. This will work for any number of columns.

Upvotes: 3

Related Questions