Reputation: 23
I am trying to group data by numbers in a column, I have tried different versions of group_by, cut, group etc but I have not been able to get it. I have a lot of data that looks like this:
position variants
3 snv
5 snv
12 snv
17 mnv
22 deletion
27 snv
33 snv
35 snv
42 snv
46 mnv
50 snv
53 deletion
60 snv
62 snv
65 snv
70 snv
variants <- c(rep("snv", 3),rep("mnv", 1),rep("deletion", 1),rep("snv", 4), "mnv", rep("snv"), "deletion", rep("snv", 4))
variants
position = c(3, 5, 12, 17, 22, 27, 33, 35, 42, 46, 50, 53, 60, 62, 65, 70)
position
patient1 = data.frame(position, variants)
patient1
I would like to be able to group the data something like this:
group tally
1-10 2snv
11-20 1snv 1mnv
21-30 1deletion 1snv
31-40 2snv
etc
so that i can run further downstream analysis. And be able to change it to groups of 1-5 or 1-2 etc. thank you very much
Upvotes: 2
Views: 1319
Reputation: 887118
We can use tidvyerse
to do a group by operation. Create a group of ranges with cut
, summarise
the frequency count based on the cut
and the 'variants', then paste
them together in summarise
library(dplyr)
patient1 %>%
group_by(group = cut(position, breaks = c(-Inf, seq(1, 100,
by = 10))), variants) %>%
summarise(n = n()) %>%
summarise(tally = paste(n, variants, collapse=' ', sep=""))
NOTE: Another option is findInterval
which does similar option as cut
but without the labels
as it will output numeric index
Upvotes: 1
Reputation: 388982
In base R, you can create a group column using findInterval
making groups of every 10 positions. We can then use aggregate
and combine the count of variants
with the variants
to create one string for each group.
patient1$group <- with(patient1, findInterval(position, (seq(0, max(position), 10))))
aggregate(variants~group, patient1, function(x) {
tb <- table(x)
paste(tb, names(tb), collapse = ' ')
})
# group variants
#1 1 2 snv
#2 2 1 mnv 1 snv
#3 3 1 deletion 1 snv
#4 4 2 snv
#5 5 1 mnv 1 snv
#6 6 1 deletion 1 snv
#7 7 3 snv
#8 8 1 snv
Upvotes: 0
Reputation: 5722
Here a pure R solution. Of course you can replace variables by their corresponding calls:
variants <- c(rep("snv", 3),rep("mnv", 1),rep("deletion", 1),rep("snv", 4), "mnv", rep("snv"), "deletion", rep("snv", 4))
position = c(3, 5, 12, 17, 22, 27, 33, 35, 42, 46, 50, 53, 60, 62, 65, 70)
patient1 = data.frame(position, variants)
labels = cut(position, seq(0, max(position), 10))
groups = split(patient1 , labels)
lapply(groups , function(x) {
paste( table(x$variants), names(table(x$variants)), collapse = ", " )
}
)
Upvotes: 2