fugu
fugu

Reputation: 6578

Count number of occurrences of column A where column B is unique

I have a data frame:

   sample    gene
1 A1     Rim2
2 A1     CG18208
3 A1     Scr 
4 A1     Scr    # gene 'Scr' occurs twice in same sample 
5 A2     CG6959
6 A2     CG6959 # gene 'CG6959' occurs twice in same sample

n<-structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", 
"A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 
3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", 
"gene"), row.names = c(NA, 6L), class = "data.frame")

And I want to get the number of times a gene is present across all samples.

I am currently using table to count the number of times each gene occurs:

hit_genes<-table(n$gene)

CG18208  CG6959       Scr    Rim2 
      1       2       2       1

But this gives me the total count for each gene, whereas I am trying to get the count across samples. For this toy example, the result I'm trying to achieve is:

CG18208  CG6959       Scr    Rim2 
      1       1       1       1

I've been trying with a combination of table and unique:

table(n$gene[unique(n$sample),])

But I can't get it to work. Can anyone suggest a way to achieve this?

Upvotes: 1

Views: 44

Answers (2)

Samuel
Samuel

Reputation: 3051

You can try this:

library(dplyr)
library(tidyr)

n <- structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A1", "A2"), class = "factor"), gene = structure(c(4L, 1L, 3L, 3L, 2L, 2L), .Label = c("CG18208", "CG6959", "Scr", "Rim2"), class = "factor")), .Names = c("sample", "gene"), row.names = c(NA, 6L), class = "data.frame")

# make CG6959 appear also in A1 for the sake of illustration
n$sample[5] <- "A1"

n %>% 
  group_by(sample, gene) %>%
  summarize(gene2 = n()) %>%
  spread(sample, gene2) %>%
  mutate(Across = ifelse(is.na(A1) | is.na(A2), 0, 1)) %>%
  filter(Across > 0)

Output:

# A tibble: 1 x 4
    gene    A1    A2 Across
  <fctr> <int> <int>  <dbl>
1 CG6959     1     1      1

So if you have many genes, this code enables you to quickly filter out and focus on the genes that appear in both samples.

Upvotes: 0

Sotos
Sotos

Reputation: 51592

You can try,

table(n[!duplicated(n),]$gene)

#CG18208  CG6959     Scr    Rim2 
#      1       1       1       1 

Upvotes: 2

Related Questions