Reputation: 51
I have the following table in R:
Sample Cluster CellType Condition Genotype Lane
Sample1 1 A Mut XXXX 1
Sample2 2 B Mut YYYY 1
Sample3 2 A Mut YYYY 2
Sample4 1 A Mut ZZZZ 1
Sample5 2 B Mut YYYY 3
Sample6 1 B Mut YYYY 1
Sample7 1 A Mut XXXX 2
I would like to:
Like so:
Cluster CellType Condition Genotype Lane
1 A (75%) Mut (100%) XXXX (50%) 1 (75%)
2 B (66%) Mut (100%) YYYY (100%) 1 (33%)
I've tried using the aggregate function as follows which yields close results, but it's not quite there yet:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)
Upvotes: 3
Views: 159
Reputation: 11955
library(dplyr)
df %>%
group_by(Cluster) %>%
summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
rate=(max(table(.))[1]/n())*100))
Output is:
Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
1 1 A Mut XXXX 1 75.0 100 50.0 75.0
2 2 B Mut YYYY 1 66.7 100 100 33.3
Or maybe
df %>%
group_by(Cluster) %>%
summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]),
" (",
rate=round((max(table(.))[1]/n())*100),
"%)")))
# Cluster CellType Condition Genotype Lane
#1 1 A (75%) Mut (100%) XXXX (50%) 1 (75%)
#2 2 B (67%) Mut (100%) YYYY (100%) 1 (33%)
Sample data:
df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4",
"Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L,
2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"),
Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY",
"XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample",
"Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA,
-7L))
Upvotes: 2
Reputation: 51582
Here is a base R solution,
m1 <- do.call(rbind,
lapply(split(df, df$Cluster),
function(i) sapply(i[3:6],
function(j) {
t1 <- prop.table(table(j));
nms <- names(t1[which.max(t1)]);
paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
})))
cbind.data.frame(unique(df[2]), m1)
which gives,
Cluster CellType Condition Genotype Lane 1 1 A (75%) Mut (100%) XXXX (50%) 1 (75%) 2 2 B (67%) Mut (100%) YYYY (100%) 1 (33%)
Upvotes: 3