Reputation: 6874
I have a dataframe as follows (called dat)
chr chrStart chrEnd Gene RChr RStart REnd Rname distance
chr1 39841 39883 Gene1 chr1 398 3984 Cha1b 0
chr1 39841 39883 Gene1 chr1 398 3985 Ab 0
chr1 39841 39883 Gene1 chr1 398 3986 Tia 0
chr1 39841 39883 Gene1 chr1 398 3987 MEA 0
chr1 39841 39883 Gene1 chr1 398 3988 La 0
chr1 39841 39883 Gene1 chr1 398 3989 M3 0
chr1 14893 15893 Gene2 chr1 398 3984 Cha1b 0
chr1 14893 15893 Gene2 chr1 398 3985 Cha1b 0
chr1 14893 15893 Gene2 chr1 398 3986 Cha1b 0
chr1 14893 15893 Gene2 chr1 398 3987 MEA 0
chr1 14893 15893 Gene2 chr1 398 3988 MEA 0
chr1 39841 39883 Gene1 chr1 398 3989 M3 0
I want to get the frequency that the different types of Rname appear for each gene so the result above should look like
Gene Rname Freq
Gene1 Cha1b 1
Gene1 Ab 1
Gene1 Tia 1
Gene1 MEA 1
Gene1 La 1
Gene1 M3 1
Gene2 Cha1b 3
Gene2 MEA 2
Gene2 M3 1
I tried doing two groupings with dplyr but I think it makes no sense and anyway it just gives me the frequency of all the Rnames for each gene
library(dplyr)
GroupTbb <- dat %>%
group_by(Gene) %>%
group_by(Rname) %>%
summarise(freq = sum(Rname))
Upvotes: 3
Views: 85
Reputation: 31161
You can try data.table
:
library(data.table)
setDT(dat)[,list(count=.N), list(Gene, Rname)]
# Gene Rname count
#1: Gene1 Cha1b 1
#2: Gene1 Ab 1
#3: Gene1 Tia 1
#4: Gene1 M3 2
#5: Gene2 Cha1b 3
#6: Gene2 MEA 2
#7: Gene1 MEA 1
#8: Gene1 La 1
Upvotes: 3
Reputation: 92282
You should use n()
(as you can't sum non-numeric values) in order to count the observations and you can group by two variables at once.
dat %>%
group_by(Gene, Rname) %>%
summarise(freq = n())
# Source: local data frame [8 x 3]
# Groups: Gene
#
# Gene Rname freq
# 1 Gene1 Ab 1
# 2 Gene1 Cha1b 1
# 3 Gene1 La 1
# 4 Gene1 M3 2
# 5 Gene1 MEA 1
# 6 Gene1 Tia 1
# 7 Gene2 Cha1b 3
# 8 Gene2 MEA 2
Or use tally
as in
dat %>%
group_by(Gene, Rname) %>%
tally()
Or (as suggested by @hrbrmstr) you can skip the grouping step by using count
dat %>%
count(Gene, Rname)
Upvotes: 3