Patrick Ortiz
Patrick Ortiz

Reputation: 25

Count character occurrences based on multiple conditions in R

I'm trying to count the occurrences of multiple different character strings in a dataframe based on multiple conditions. I have the following dataframe (mut.total) containing the following information:

   TYPE Sample    Genotype   Mutagen             Dose
1   DUP CD0001c   N2         MA                  20
2   DEL CD0001d   N2         MA                  20
3   DUP CD0030a   N2         MA                  20
4   DEL CD0035a   N2         Mechlorethamine     20
5   INV CD0035a   N2         Mechlorethamine     20
6   INV CD0035a   N2         Mechlorethamine     20
7   DUP CD0035a   N2         Mechlorethamine     20
8   DEL CD0035a   N2         Mechlorethamine     20
9   DEL CD0035c   N2         Mechlorethamine     20
10  DUP CD0035d   N2         Mechlorethamine     20

I want to produce a dataframe showing by mutagen and by type the total number of mutations and the number of samples the mutations came from (keeping in mind that one sample may produce multiple mutations of the same type). An example of my intended output:

          Mutagen Type N.Mut   N.Sample
1              MA  DEL 1       1
2 Mechlorethamine  DEL 3       2
3              MA  DUP 2       2
4 Mechlorethamine  DUP 2       2
5 Mechlorethamine  INV 2       1

Using aggregate I am able to generate the the number of mutations by mutagen and by type, but I cannot figure out how to add the number of samples the mutations came from.

aggregate(x=mut.total$TYPE, by=list(Mutagen = mut.total$Mutagen, Type = mut.total$TYPE),
                         FUN = length)
          Mutagen Type N.Mut
1              MA  DEL 1
2 Mechlorethamine  DEL 3
3              MA  DUP 2
4 Mechlorethamine  DUP 2
5 Mechlorethamine  INV 2

Upvotes: 2

Views: 319

Answers (3)

akrun
akrun

Reputation: 887078

Using collapse

library(collapse)
collap(mut.total, ~ Mutagen + TYPE, custom = list(fNobs = 1, fNdistinct = 2 ))

data

mut.total <- structure(list(TYPE = c("DUP", "DEL", "DUP", "DEL", "INV", "INV", 
"DUP", "DEL", "DEL", "DUP"), Sample = c("CD0001c", "CD0001d", 
"CD0030a", "CD0035a", "CD0035a", "CD0035a", "CD0035a", "CD0035a", 
"CD0035c", "CD0035d"), Genotype = c("N2", "N2", "N2", "N2", "N2", 
"N2", "N2", "N2", "N2", "N2"), Mutagen = c("MA", "MA", "MA", 
"Mechlorethamine", "Mechlorethamine", "Mechlorethamine", "Mechlorethamine", 
"Mechlorethamine", "Mechlorethamine", "Mechlorethamine"), Dose = c(20L, 
20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L)), row.names = c(NA, 
-10L), class = "data.frame")

Upvotes: 1

Ben
Ben

Reputation: 30474

Here is a dplyr version:

library(dplyr)

mut.total %>%
  group_by(Mutagen, TYPE) %>%
  summarise(N.Mut = n(), N.Sample = n_distinct(Sample))

Output

  Mutagen         TYPE  N.Mut N.Sample
  <chr>           <chr> <int>    <int>
1 MA              DEL       1        1
2 MA              DUP       2        2
3 Mechlorethamine DEL       3        2
4 Mechlorethamine DUP       2        2
5 Mechlorethamine INV       2        1

Upvotes: 2

Wimpel
Wimpel

Reputation: 27732

a data.table approach

library(data.table)
DT <- fread("   TYPE Sample    Genotype   Mutagen             Dose
   DUP CD0001c   N2         MA                  20
   DEL CD0001d   N2         MA                  20
   DUP CD0030a   N2         MA                  20
   DEL CD0035a   N2         Mechlorethamine     20
   INV CD0035a   N2         Mechlorethamine     20
   INV CD0035a   N2         Mechlorethamine     20
   DUP CD0035a   N2         Mechlorethamine     20
   DEL CD0035a   N2         Mechlorethamine     20
   DEL CD0035c   N2         Mechlorethamine     20
  DUP CD0035d   N2         Mechlorethamine     20")

DT[, .(N.Mut    = .N, 
       N.Sample = uniqueN(Sample)),
   by = .(Mutagen, TYPE)]
#            Mutagen TYPE N.Mut N.Sample
# 1:              MA  DUP     2        2
# 2:              MA  DEL     1        1
# 3: Mechlorethamine  DEL     3        2
# 4: Mechlorethamine  INV     2        1
# 5: Mechlorethamine  DUP     2        2

Upvotes: 1

Related Questions