karnowski
karnowski

Reputation: 93

How to normalise subgroups from a grouped data frame in R

I have a data frame with two numerical variables fatcontent and saltcontent plus two factor variables cond and spice that describe the different treatments. In this data frame each measurement for the numerical varibles was taken twice.

a <- data.frame(cond = rep(c("uncooked", "fried", "steamed", "baked", "grilled"),
                       each = 2, times = 3),
                spice = rep(c("none", "chilli", "basil"), each = 10),
                fatcontent = c(4, 5, 6828, 7530, 6910, 7132, 5885, 613, 2845, 2867,
                               25, 18, 2385, 33227, 4233, 4023, 953, 1025, 4465, 5016,
                               5, 5, 10235, 12545, 5511, 5111, 596, 585, 4012, 3633),
                saltcontent = c(2, 5, 4733, 5500, 5724, 15885, 14885, 217, 193, 148,
                                6, 4, 26738, 24738, 22738, 23738, 267, 256, 1121, 1558,
                                1, 1, 21738, 20738, 26738, 27738, 195, 202, 129, 131)
                )

Now, I wish to nomalise (that means divide in this case) the numerical variables for each spice group by the mean of the uncooked condition.
E.g. for a$spice == "none"

       cond  spice fatcontent saltcontent  
1  uncooked   none          4           2  
2  uncooked   none          5           5  
3     fried   none       6828        4733  
4     fried   none       7530        5500  
5   steamed   none       6910        5724  
6   steamed   none       7132       15885  
7     baked   none       5885       14885  
8     baked   none        613         217  
9   grilled   none       2845         193  
10  grilled   none       2867         148   

After normalisation:

       cond spice   fatcontent  saltcontent
1  uncooked  none    0.8888889    0.5714286
2  uncooked  none    1.1111111    1.4285714
3     fried  none 1517.3333333 1352.2857143
4     fried  none 1673.3333333 1571.4285714
5   steamed  none 1535.5555556 1635.4285714
6   steamed  none 1584.8888889 4538.5714286
7     baked  none 1307.7777778 4252.8571429
8     baked  none  136.2222222   62.0000000
9   grilled  none  632.2222222   55.1428571
10  grilled  none  637.1111111   42.2857143

My questions is how can I do this for all the groups and variables in the data frame? I assume I could use the dplyr package but I am not sure what is the best way. I appreciate any help!

Upvotes: 8

Views: 4015

Answers (3)

talat
talat

Reputation: 70256

A succinct way to normalize the data would be to include the "uncooked" condition right in the mean calculation so you don't need to filter, summarise, join and recalculate. Doing this with mutate_each means you only need to type it once.

group_by(a, spice) %>%
  mutate_each(funs(./mean(.[cond == "uncooked"])), -cond)

#Source: local data frame [30 x 4]
#Groups: spice
#
#       cond  spice   fatcontent  saltcontent
#1  uncooked   none    0.8888889 5.714286e-01
#2  uncooked   none    1.1111111 1.428571e+00
#3     fried   none 1517.3333333 1.352286e+03
#4     fried   none 1673.3333333 1.571429e+03
#5   steamed   none 1535.5555556 1.635429e+03
#6   steamed   none 1584.8888889 4.538571e+03
#7     baked   none 1307.7777778 4.252857e+03
#8     baked   none  136.2222222 6.200000e+01
#9   grilled   none  632.2222222 5.514286e+01
#10  grilled   none  637.1111111 4.228571e+01
# ... etc

Upvotes: 5

jazzurro
jazzurro

Reputation: 23574

I think this is what you are after. You want to find mean for each spice condition using uncooked data points. That is something I have done in my first step. Then, I wanted to add fatmean and saltmean in ana to your data frame, a. If your data is really huge, this may not be a memory efficient way. But, I used left_join to merge ana and a. I, then, did division in mutate for each spice condition. Finally, I dropped two columns for tidying up the results using select.

### Find mean for each spice condition using uncooked data points                
ana <- group_by(filter(a, cond == "uncooked"), spice) %>%
       summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) 

 #   spice fatmean saltmean
 #1  basil     5.0      1.0
 #2 chilli    21.5      5.0
 #3   none     4.5      3.5

left_join(a, ana, by = "spice") %>%
group_by(spice) %>%
mutate(fatcontent = fatcontent / fatmean,
       saltcontent = saltcontent / saltmean) %>%
select(-c(fatmean, saltmean))

# A part of the results
#       cond spice   fatcontent  saltcontent
#1  uncooked  none    0.8888889    0.5714286
#2  uncooked  none    1.1111111    1.4285714
#3     fried  none 1517.3333333 1352.2857143
#4     fried  none 1673.3333333 1571.4285714
#5   steamed  none 1535.5555556 1635.4285714
#6   steamed  none 1584.8888889 4538.5714286
#7     baked  none 1307.7777778 4252.8571429
#8     baked  none  136.2222222   62.0000000
#9   grilled  none  632.2222222   55.1428571
#10  grilled  none  637.1111111   42.2857143

If you do all things in one piping, it would be something like this:

group_by(filter(a, cond == "uncooked"), spice) %>%
    summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) %>%
    left_join(a, ., by = "spice") %>% #right_join is possible with the dev dplyr
    group_by(spice) %>%
    mutate(fatcontent = fatcontent / fatmean,
           saltcontent = saltcontent / saltmean) %>%
    select(-c(fatmean, saltmean))

Upvotes: 4

Andrew
Andrew

Reputation: 38619

All you need to do is group by both the condition and spice, like so:

library(dplyr)
a %>% group_by(spice, cond) %>%
  mutate(fat.norm = fatcontent / mean(fatcontent),
         salt.norm = saltcontent / mean(saltcontent))

# Source: local data frame [90 x 6]
# Groups: spice, cond
# 
#        cond  spice fatcontent saltcontent  fat.norm  salt.norm
# 1  uncooked   none          4           2 0.8888889 0.57142857
# 2  uncooked   none          5           5 1.1111111 1.42857143
# 3     fried   none       6828        4733 0.9511074 0.92504642
# 4     fried   none       7530        5500 1.0488926 1.07495358
# 5   steamed   none       6910        5724 0.9841903 0.52977926
# 6   steamed   none       7132       15885 1.0158097 1.47022074
# 7     baked   none       5885       14885 1.8113266 1.97126208
# 8     baked   none        613         217 0.1886734 0.02873792
# 9   grilled   none       2845         193 0.9961485 1.13196481
# 10  grilled   none       2867         148 1.0038515 0.86803519

Alternatively, if you don't want to specify each column, you can use mutate_each or summarise_each:

group.norm <- function(x) {
  x / mean(x)
}

a %>% group_by(spice, cond) %>%
  mutate_each(funs(group.norm))

You can exclude columns or specify only specific columns in mutate_each() as well, like mutate_each(funs(group.norm), -notthisone) or mutate_each(funs(group.norm), onlythisone)

Upvotes: 2

Related Questions