Lou_A
Lou_A

Reputation: 259

How do I aggregate a dataframe and sum the values of a column by repeated rows in r

I'm attempting to aggregate a dataframe to remove repeated rows, but I need to sum the value of a count column and use it as the new count for that row value. I have the following dataframe:

  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

I need to get to this:

   count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
   8562    0.048822 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    778    0.003332 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Instead, I'm getting this:

      count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
        3    0.601110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
        2    0.506717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
        1    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Here's piece of the code that's not working right:

  agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
             DEnd +   JStart, data = final_df, 
                FUN = length), freq <- count/sum(count))


  agg_df1 <-select(agg_df, count, freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) 

Instead of adding the "count" value of the corresponding repeated rows, so I can recalculate the frequency, it's basically counting the number of times the particular row is repeated. Any thoughts are greatly appreciated. Thanks.

Upvotes: 0

Views: 62

Answers (1)

Len Greski
Len Greski

Reputation: 10855

The use of FUN = length causes the output value of count to be the number of occurrences for each of the by groups. Instead, use FUN = sum to calculate the sum of the input count column.

textFile <- "  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31"

final_df <- read.table(text = textFile,
                   header = TRUE)

# original code had FUN = length, which returned the number of rows per
# combination of by groups 
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
          DEnd +   JStart, data = final_df, FUN = sum), freq <- count/sum(count))
agg_df

...and the output:

> agg_df
   cdr3nt cdr3aa        v     d       j VEnd DStart DEnd JStart count       freq
1 TGCGCCA  CASMG TRBV10-2 TRBD1 TRBJ1-1    9     15   19     20   778 0.08062176
2 TGTGCCA  CASSE  TRBV6-1 TRBD1 TRBJ2-6   13     18   22     24  8562 0.88725389
3 TGCAGTG  CSARD TRBV20-1 TRBD1 TRBJ1-5   15     17   23     31   310 0.03212435
> 

We can confirm accuracy of the freq column as follows:

> # confirm accuracy 
> agg_df$count / sum(agg_df$count)
[1] 0.08062176 0.88725389 0.03212435
> 

Upvotes: 1

Related Questions