Brent
Brent

Reputation: 465

How to create multiple new columns based on multiple conditional statements?

[ First Stack question please be kind :) ]

I'm creating multiple new columns in a data frame based on multiple conditional statements of existing columns - all essentially new combinations of columns.

For example, if there are 4 columns (a:d), I need new columns of all combinations (abcd, abc, abd, etc) and a 0/1 coding based on threshold data in a:d.

Toy data example included and desired outcome. However needs to be scalable: there are 4 base columns, but I need all combinations of 2, 3 and 4 columns not just 3-value (abc, abd, .... ab, ac, ad, ... total n = 11)

[Background for context: this is actually flow cytometry data from multipotent stem cells that can grow into colonies of all lineage cell type (multipotent, or abcd) or progressively more restricted populations (only abc, or abd, ab, ac, etc)

# Toy data set
set.seed(123)
df <- tibble(a = c(sample(10:50, 10)),
             b = c(sample(10:50, 10)),
             c = c(sample(10:50, 10)),
             d = c(sample(10:50, 10)))

Current code produces the desired result however, this needs 11 lines of repetitive code which is error prone and I hope has a more elegant solution:

df %>% 
  mutate(
    abcd = if_else(a > 30 & b > 20 & c > 30 & d > 30, 1, 0),
    abc = if_else(a > 30 & b > 20 & c > 30 & d <= 30, 1, 0),
    abd = if_else(a > 30 & b > 20 & c <= 30 & d > 30, 1, 0),
    acd = if_else(a > 30 & b <= 20 & c > 30 & d > 30, 1, 0),
    bcd = if_else(a <= 30 & b > 20 & c > 30 & d > 30, 1, 0))

Upvotes: 1

Views: 120

Answers (2)

nsinghphd
nsinghphd

Reputation: 2022

What I understand from your question, for each row you just need to find which columns meet the criteria defined in your ifelse() conditions. This vectorized solution will add a column to your df which contains all the combinations. This probably is also faster than multiple ifelse conditions as well. Finally, the new column can be used for ordering or grouping.

# define the threshold levels for all columns
threshold = c(a=30, b=20, c=30, d=30)

# get names of columns meeting the threshold and paste names
df$combn <- apply(df, 1, function(x) {
  paste(names(x)[x > threshold], collapse = "")
})

> df
# A tibble: 10 x 5
       a     b     c     d combn
   <int> <int> <int> <int> <chr>
 1    21    49    46    49 bcd  
 2    41    28    37    46 abcd 
 3    25    36    34    36 bcd  
 4    43    31    47    40 abcd 
 5    44    13    48    10 ac   
 6    11    42    35    27 bc   
 7    28    18    29    48 d    
 8    40    11    30    17 a    
 9    46    20    19    20 a    
10    24    40    14    43 bd   

Upvotes: 2

liborm
liborm

Reputation: 2724

If I get that correctly, you want to categorize each row into exactly one class, so getting the category name as concatenation of threshold tests should be enough. Then you can get 0/1 columns using spread():

df %>% 
  mutate(
    a_ = if_else(a > 30, 'a', 'x'),
    b_ = if_else(b > 20, 'b', 'x'),
    c_ = if_else(c > 30, 'c', 'x'),
    d_ = if_else(d > 30, 'd', 'x'),
    all_ = paste0(a_, b_, c_, d_),
    one_ = 1) %>% 
  spread(all_, one_, fill = 0) %>% 
  select(-ends_with("_"))

Gives

# A tibble: 10 x 11
       a     b     c     d  abcd  axcx  axxx  xbcd  xbcx  xbxd  xxxd
   <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    11    42    35    27     0     0     0     0     1     0     0
 2    21    49    46    49     0     0     0     1     0     0     0
 3    24    40    14    43     0     0     0     0     0     1     0
 4    25    36    34    36     0     0     0     1     0     0     0
 5    28    18    29    48     0     0     0     0     0     0     1
 6    40    11    30    17     0     0     1     0     0     0     0
 7    41    28    37    46     1     0     0     0     0     0     0
 8    43    31    47    40     1     0     0     0     0     0     0
 9    44    13    48    10     0     1     0     0     0     0     0
10    46    20    19    20     0     0     1     0     0     0     0

(You can use '' instead of 'x', but then spread() will overwrite some of your original columns.)

Upvotes: 0

Related Questions