Reputation: 6578
I have a data frame:
sample gene
1 A1 Rim2
2 A1 CG18208
3 A1 Scr
4 A1 Scr # gene 'Scr' occurs twice in same sample
5 A2 CG6959
6 A2 CG6959 # gene 'CG6959' occurs twice in same sample
n<-structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1",
"A2"), class = "factor"), gene = structure(c(4L, 1L, 3L,
3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample",
"gene"), row.names = c(NA, 6L), class = "data.frame")
And I want to get the number of times a gene
is present across all samples
.
I am currently using table to count the number of times each gene occurs:
hit_genes<-table(n$gene)
CG18208 CG6959 Scr Rim2
1 2 2 1
But this gives me the total count for each gene, whereas I am trying to get the count across samples. For this toy example, the result I'm trying to achieve is:
CG18208 CG6959 Scr Rim2
1 1 1 1
I've been trying with a combination of table and unique:
table(n$gene[unique(n$sample),])
But I can't get it to work. Can anyone suggest a way to achieve this?
Upvotes: 1
Views: 44
Reputation: 3051
You can try this:
library(dplyr)
library(tidyr)
n <- structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", "A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", "gene"), row.names = c(NA, 6L), class = "data.frame")
# make CG6959 appear also in A1 for the sake of illustration
n$sample[5] <- "A1"
n %>%
group_by(sample, gene) %>%
summarize(gene2 = n()) %>%
spread(sample, gene2) %>%
mutate(Across = ifelse(is.na(A1) | is.na(A2), 0, 1)) %>%
filter(Across > 0)
Output:
# A tibble: 1 x 4
gene A1 A2 Across
<fctr> <int> <int> <dbl>
1 CG6959 1 1 1
So if you have many genes, this code enables you to quickly filter out and focus on the genes that appear in both samples.
Upvotes: 0
Reputation: 51592
You can try,
table(n[!duplicated(n),]$gene)
#CG18208 CG6959 Scr Rim2
# 1 1 1 1
Upvotes: 2