Reputation: 259
I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are
count freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart
5036 0.0599 TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5 15 17 43 21
There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.
For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.
Here's my script.
library(dplyr)
# Input sample replicate table.
dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")
# combine rows with identical data. Recalculation of frequency values.
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
dta_clean <- dta
Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.
Upvotes: 0
Views: 53
Reputation: 6226
Preliminary step: transform in data.table
and store column names that are not count
and freq
library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]
(in your example, count
and freq
are in the first two positions)
To recompute count and freq:
df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]
If you want to keep unique values by all columns except count
and freq
df_unique <- unique(df, by = cols)
Upvotes: 1
Reputation: 160952
Sample data, where grp1
and grp2
are intended to be all of your grouping variables.
set.seed(42)
dat <- data.frame(
grp1 = sample(1:2, size=20, replace=TRUE),
grp2 = sample(3:4, size=20, replace=TRUE),
count = sample(100, size=20, replace=TRUE),
freq = runif(20)
)
head(dat)
# grp1 grp2 count freq
# 1 2 4 38 0.6756073
# 2 2 3 44 0.9828172
# 3 1 4 4 0.7595443
# 4 2 4 98 0.5664884
# 5 2 3 44 0.8496897
# 6 2 4 96 0.1894739
Code:
library(dplyr)
dat %>%
group_by(grp1, grp2) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
mutate(freq = count / sum(count))
# # A tibble: 4 x 4
# grp1 grp2 count freq
# <int> <int> <int> <dbl>
# 1 1 3 22 0.0206
# 2 1 4 208 0.195
# 3 2 3 383 0.358
# 4 2 4 456 0.427
Upvotes: 0