D Greenwood
D Greenwood

Reputation: 436

Normalise all numeric columns in a dataframe by relevant control group with R and dplyr

I have a large data set with multiple observations of different conditions:

Condition = c('A', 'A', 'B', 'B', 'C', 'C', 'D', 'D')
Control = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B')
Value_1 = 1:4
Value_2 = 2 * 1:4
Value_3 = 3 * 1:4

t = data.frame(Condition, Control, Value_1, Value_2, Value_3)

  Condition Control Value_1 Value_2 Value_3
1         A       A       1       2       3
2         A       A       2       4       6
3         B       A       3       6       9
4         B       A       4       8      12
5         C       B       1       2       3
6         C       B       2       4       6
7         D       B       3       6       9
8         D       B       4       8      12

I want to divide each of the value columns by the mean of the values of their specified control group. The desired output is:

  Condition Control Value_1 Value_2 Value_3
  <chr>     <chr>     <dbl>   <dbl>   <dbl>
1 A         A         0.667   0.667   0.667
2 A         A         1.33    1.33    1.33 
3 B         A         0.857   0.857   0.857
4 B         A         1.14    1.14    1.14 
5 C         C         0.667   0.667   0.667
6 C         C         1.33    1.33    1.33 
7 D         C         0.857   0.857   0.857
8 D         C         1.14    1.14    1.14 

If I only had one control group and specified the columns then I would use:

t %>% group_by(Control) %>%
   mutate(Value_1 = Value_1/Value_1[Condition == 'A'])

However here this will not work for several reasons: I have multiple values for each control group that need to be meaned first. I also have multiple controls, and need to specify that the relevant control for each row is the one specified for that row (not just A in every case). I also want to apply the normalisation to every numerical column. I know that mutate_if(is.numeric, .fun) can be used to select numeric columns, but I do not know how it would be possible to write a generic function to perform the normalisation to a control group.

With this small dataset it would be easiest to just split it into each control group, and to specify the mutations by naming each colony manually. However I am looking for a solution that can handle larger datasets with arbitrary numbers of variables and control groups.

Upvotes: 0

Views: 509

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28675

You can make a table of the means for each condition, then left join that to your data. Now that you have the means and the values in the same table, you just need to transmute to do the division.

cond_means <- 
  t %>% 
    group_by(Condition) %>% 
    summarise_if(is.numeric, mean)

t %>% 
  left_join(cond_means, by = c(Control = 'Condition')) %>% 
  transmute(Condition, 
            Control,
            Value_1 = Value_1.x/Value_1.y,
            Value_2 = Value_2.x/Value_2.y,
            Value_3 = Value_3.x/Value_3.y)

#   Condition Control   Value_1   Value_2   Value_3
# 1         A       A 0.6666667 0.6666667 0.6666667
# 2         A       A 1.3333333 1.3333333 1.3333333
# 3         B       A 2.0000000 2.0000000 2.0000000
# 4         B       A 2.6666667 2.6666667 2.6666667
# 5         C       B 0.2857143 0.2857143 0.2857143
# 6         C       B 0.5714286 0.5714286 0.5714286
# 7         D       B 0.8571429 0.8571429 0.8571429
# 8         D       B 1.1428571 1.1428571 1.1428571

One change I made was to use stringsAsFactors = FALSE when creating the table, because factors are a pain to work with.

t = data.frame(Condition, Control, Value_1, Value_2, Value_3, 
               stringsAsFactors = FALSE)

Upvotes: 2

Related Questions