Removing repeated rows (by several columns) and recalculating count and frequency values in R

Question

I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are

 count  freq,   cdr3nt,       cdr3aa,    v,      d,    j,     VEnd, DStart, DEnd, JStart
 5036   0.0599  TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5  15     17     43    21

There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.

For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.

Here's my script.

library(dplyr)

# Input sample replicate table.
  dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="	")

# combine rows with identical data.  Recalculation of frequency values.
 dta %>% mutate(total = sum(count)) %>%
    group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
    summarize(count_new = sum(count), freq = count_new/mean(total))

 dta_clean <- dta

Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.

linog · Accepted Answer

Preliminary step: transform in data.table and store column names that are not count and freq

library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]

(in your example, count and freq are in the first two positions)

To recompute count and freq:

df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]

If you want to keep unique values by all columns except count and freq

df_unique <- unique(df, by = cols)

Removing repeated rows (by several columns) and recalculating count and frequency values in R

Answers (2)

Related Questions