I have the following table in R: Sample Cluster CellType Condition Genotype Lane Sample1 1 A Mut XXXX 1 Sample2 2 B Mut YYYY 1 Sample3 2 A Mut YYYY 2 Sample4 1 A Mut ZZZZ 1 Sample5 2 B Mut YYYY 3 Sample6 1 B Mut YYYY 1 Sample7 1 A Mut XXXX 2 I would like to: Aggregate the table by the Cluster column, Where each other column yields the dominant value which relates to the cluster As well as the "confidence level", as a percentage of dominance from the values related to the same cluster Like so: Cluster CellType Condition Genotype Lane 1 A (75%) Mut (100%) XXXX (50%) 1 (75%) 2 B (66%) Mut (100%) YYYY (100%) 1 (33%) I've tried using the aggregate function as follows which yields close results, but it's not quite there yet: Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } library(dplyr) aggregate(. ~ Cluster, clustering_report, Mode)

Reputation: 51

Aggregating a categorical table in R (With percentages)

I have the following table in R:

Sample             Cluster  CellType  Condition  Genotype  Lane
Sample1            1        A         Mut        XXXX      1
Sample2            2        B         Mut        YYYY      1
Sample3            2        A         Mut        YYYY      2
Sample4            1        A         Mut        ZZZZ      1
Sample5            2        B         Mut        YYYY      3
Sample6            1        B         Mut        YYYY      1
Sample7            1        A         Mut        XXXX      2

I would like to:

Aggregate the table by the Cluster column,
Where each other column yields the dominant value which relates to the cluster
As well as the "confidence level", as a percentage of dominance from the values related to the same cluster

Like so:

Cluster      CellType  Condition  Genotype     Lane
1            A (75%)   Mut (100%) XXXX (50%)   1 (75%)
2            B (66%)   Mut (100%) YYYY (100%)  1 (33%)

I've tried using the aggregate function as follows which yields close results, but it's not quite there yet:

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)

Upvotes: 3

Answers (2)

Prem

Reputation: 11955

library(dplyr)

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
                                         rate=(max(table(.))[1]/n())*100))

Output is:

  Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
1       1 A            Mut           XXXX         1                 75.0            100          50.0      75.0
2       2 B            Mut           YYYY         1                 66.7            100         100        33.3

Or maybe

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]), 
                                                " (",
                                                rate=round((max(table(.))[1]/n())*100), 
                                                "%)")))

#  Cluster CellType Condition  Genotype    Lane   
#1       1 A (75%)  Mut (100%) XXXX (50%)  1 (75%)
#2       2 B (67%)  Mut (100%) YYYY (100%) 1 (33%)

Sample data:

df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4", 
"Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L, 
2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"), 
    Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
    ), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY", 
    "XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample", 
"Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 2

Sotos

Reputation: 51582

Here is a base R solution,

m1 <- do.call(rbind, 
        lapply(split(df, df$Cluster), 
               function(i) sapply(i[3:6], 
                                  function(j) {
                                    t1 <- prop.table(table(j)); 
                                    nms <- names(t1[which.max(t1)]); 
                                    paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
                                    })))

cbind.data.frame(unique(df[2]), m1)

which gives,

Cluster CellType  Condition    Genotype    Lane
1       1  A (75%) Mut (100%)  XXXX (50%) 1 (75%)
2       2  B (67%) Mut (100%) YYYY (100%) 1 (33%)

Upvotes: 3

Aggregating a categorical table in R (With percentages)

Answers (2)

Related Questions