Lou_A
Lou_A

Reputation: 259

Removing repeated rows (by several columns) and recalculating count and frequency values in R

I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are

 count  freq,   cdr3nt,       cdr3aa,    v,      d,    j,     VEnd, DStart, DEnd, JStart
 5036   0.0599  TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5  15     17     43    21

There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.

For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.

Here's my script.

library(dplyr)

# Input sample replicate table.
  dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")

# combine rows with identical data.  Recalculation of frequency values.
 dta %>% mutate(total = sum(count)) %>%
    group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
    summarize(count_new = sum(count), freq = count_new/mean(total))

 dta_clean <- dta

Any help is greatly appreciated. Here's a screenshot of how the datatable looks like. screenshot

Upvotes: 0

Views: 53

Answers (2)

linog
linog

Reputation: 6226

Preliminary step: transform in data.table and store column names that are not count and freq

library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]

(in your example, count and freq are in the first two positions)

To recompute count and freq:

df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]

If you want to keep unique values by all columns except count and freq

df_unique <- unique(df, by = cols)

Upvotes: 1

r2evans
r2evans

Reputation: 160952

Sample data, where grp1 and grp2 are intended to be all of your grouping variables.

set.seed(42)
dat <- data.frame(
  grp1 = sample(1:2, size=20, replace=TRUE),
  grp2 = sample(3:4, size=20, replace=TRUE),
  count = sample(100, size=20, replace=TRUE),
  freq = runif(20)
)
head(dat)
#   grp1 grp2 count      freq
# 1    2    4    38 0.6756073
# 2    2    3    44 0.9828172
# 3    1    4     4 0.7595443
# 4    2    4    98 0.5664884
# 5    2    3    44 0.8496897
# 6    2    4    96 0.1894739

Code:

library(dplyr)
dat %>%
  group_by(grp1, grp2) %>%
  summarize(count = sum(count)) %>%
  ungroup() %>%
  mutate(freq = count / sum(count))
# # A tibble: 4 x 4
#    grp1  grp2 count   freq
#   <int> <int> <int>  <dbl>
# 1     1     3    22 0.0206
# 2     1     4   208 0.195 
# 3     2     3   383 0.358 
# 4     2     4   456 0.427 

Upvotes: 0

Related Questions