Reputation: 259
I'm attempting to aggregate a dataframe to remove repeated rows, but I need to sum the value of a count column and use it as the new count for that row value. I have the following dataframe:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3154 0.036110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2800 0.038394 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2608 0.033014 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
412 0.004717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
366 0.005015 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
I need to get to this:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
8562 0.048822 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
778 0.003332 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
Instead, I'm getting this:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3 0.601110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2 0.506717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
1 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
Here's piece of the code that's not working right:
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart +
DEnd + JStart, data = final_df,
FUN = length), freq <- count/sum(count))
agg_df1 <-select(agg_df, count, freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart)
Instead of adding the "count" value of the corresponding repeated rows, so I can recalculate the frequency, it's basically counting the number of times the particular row is repeated. Any thoughts are greatly appreciated. Thanks.
Upvotes: 0
Views: 62
Reputation: 10855
The use of FUN = length
causes the output value of count
to be the number of occurrences for each of the by groups. Instead, use FUN = sum
to calculate the sum of the input count
column.
textFile <- " count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3154 0.036110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2800 0.038394 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2608 0.033014 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
412 0.004717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
366 0.005015 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31"
final_df <- read.table(text = textFile,
header = TRUE)
# original code had FUN = length, which returned the number of rows per
# combination of by groups
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart +
DEnd + JStart, data = final_df, FUN = sum), freq <- count/sum(count))
agg_df
...and the output:
> agg_df
cdr3nt cdr3aa v d j VEnd DStart DEnd JStart count freq
1 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20 778 0.08062176
2 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24 8562 0.88725389
3 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31 310 0.03212435
>
We can confirm accuracy of the freq
column as follows:
> # confirm accuracy
> agg_df$count / sum(agg_df$count)
[1] 0.08062176 0.88725389 0.03212435
>
Upvotes: 1