Reputation: 25
I'm trying to count the occurrences of multiple different character strings in a dataframe based on multiple conditions. I have the following dataframe (mut.total) containing the following information:
TYPE Sample Genotype Mutagen Dose
1 DUP CD0001c N2 MA 20
2 DEL CD0001d N2 MA 20
3 DUP CD0030a N2 MA 20
4 DEL CD0035a N2 Mechlorethamine 20
5 INV CD0035a N2 Mechlorethamine 20
6 INV CD0035a N2 Mechlorethamine 20
7 DUP CD0035a N2 Mechlorethamine 20
8 DEL CD0035a N2 Mechlorethamine 20
9 DEL CD0035c N2 Mechlorethamine 20
10 DUP CD0035d N2 Mechlorethamine 20
I want to produce a dataframe showing by mutagen and by type the total number of mutations and the number of samples the mutations came from (keeping in mind that one sample may produce multiple mutations of the same type). An example of my intended output:
Mutagen Type N.Mut N.Sample
1 MA DEL 1 1
2 Mechlorethamine DEL 3 2
3 MA DUP 2 2
4 Mechlorethamine DUP 2 2
5 Mechlorethamine INV 2 1
Using aggregate I am able to generate the the number of mutations by mutagen and by type, but I cannot figure out how to add the number of samples the mutations came from.
aggregate(x=mut.total$TYPE, by=list(Mutagen = mut.total$Mutagen, Type = mut.total$TYPE),
FUN = length)
Mutagen Type N.Mut
1 MA DEL 1
2 Mechlorethamine DEL 3
3 MA DUP 2
4 Mechlorethamine DUP 2
5 Mechlorethamine INV 2
Upvotes: 2
Views: 319
Reputation: 887078
Using collapse
library(collapse)
collap(mut.total, ~ Mutagen + TYPE, custom = list(fNobs = 1, fNdistinct = 2 ))
mut.total <- structure(list(TYPE = c("DUP", "DEL", "DUP", "DEL", "INV", "INV",
"DUP", "DEL", "DEL", "DUP"), Sample = c("CD0001c", "CD0001d",
"CD0030a", "CD0035a", "CD0035a", "CD0035a", "CD0035a", "CD0035a",
"CD0035c", "CD0035d"), Genotype = c("N2", "N2", "N2", "N2", "N2",
"N2", "N2", "N2", "N2", "N2"), Mutagen = c("MA", "MA", "MA",
"Mechlorethamine", "Mechlorethamine", "Mechlorethamine", "Mechlorethamine",
"Mechlorethamine", "Mechlorethamine", "Mechlorethamine"), Dose = c(20L,
20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L)), row.names = c(NA,
-10L), class = "data.frame")
Upvotes: 1
Reputation: 30474
Here is a dplyr
version:
library(dplyr)
mut.total %>%
group_by(Mutagen, TYPE) %>%
summarise(N.Mut = n(), N.Sample = n_distinct(Sample))
Output
Mutagen TYPE N.Mut N.Sample
<chr> <chr> <int> <int>
1 MA DEL 1 1
2 MA DUP 2 2
3 Mechlorethamine DEL 3 2
4 Mechlorethamine DUP 2 2
5 Mechlorethamine INV 2 1
Upvotes: 2
Reputation: 27732
a data.table approach
library(data.table)
DT <- fread(" TYPE Sample Genotype Mutagen Dose
DUP CD0001c N2 MA 20
DEL CD0001d N2 MA 20
DUP CD0030a N2 MA 20
DEL CD0035a N2 Mechlorethamine 20
INV CD0035a N2 Mechlorethamine 20
INV CD0035a N2 Mechlorethamine 20
DUP CD0035a N2 Mechlorethamine 20
DEL CD0035a N2 Mechlorethamine 20
DEL CD0035c N2 Mechlorethamine 20
DUP CD0035d N2 Mechlorethamine 20")
DT[, .(N.Mut = .N,
N.Sample = uniqueN(Sample)),
by = .(Mutagen, TYPE)]
# Mutagen TYPE N.Mut N.Sample
# 1: MA DUP 2 2
# 2: MA DEL 1 1
# 3: Mechlorethamine DEL 3 2
# 4: Mechlorethamine INV 2 1
# 5: Mechlorethamine DUP 2 2
Upvotes: 1