Reputation: 65
I have the following data frame in R:
gene_name gene_number
ENSMUSG00000000001 4732
ENSMUSG00000000001 4733
ENSMUSG00000000058 7603
ENSMUSG00000000058 7604
ENSMUSG00000000058 8246
ENSMUSG00000000058 8248
ENSMUSG00000000058 9001
The data is grouped by gene_name column, and the gene_number is sorted by other parameters (not relevant for the question). I want to sub-group the data according to the gene_number. Inside each group, I want to sub group the data if the values in gene_number are not sequential / consecutive or the maximum differences between following rows is 2. If there is only 1 value without sequential value, I would like to remover it.
I want to have a new column specifying the new groups.
For example, in the data above:
ENSMUSG00000000001 4732 1
ENSMUSG00000000001 4733 1
ENSMUSG00000000058 7603 2
ENSMUSG00000000058 7604 2
ENSMUSG00000000058 8246 3
ENSMUSG00000000058 8248 3
Thank you!
Upvotes: 2
Views: 316
Reputation: 887971
Using data.table
library(data.table)
setDT(df)[, grp := c(TRUE, diff(gene_number) > 2), gene_name][,
grp := cumsum(grp)][, .SD[.N>1], grp]
grp gene_name gene_number
1: 1 ENSMUSG00000000001 4732
2: 1 ENSMUSG00000000001 4733
3: 2 ENSMUSG00000000058 7603
4: 2 ENSMUSG00000000058 7604
5: 3 ENSMUSG00000000058 8246
6: 3 ENSMUSG00000000058 8248
df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001",
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058",
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L,
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)),
class = "data.frame", row.names = c(NA, -7L))
Upvotes: 1
Reputation: 389325
Here is one dplyr
option -
library(dplyr)
df %>%
group_by(gene_name) %>%
mutate(grp = gene_number - lag(gene_number, default = 0) > 2) %>%
group_by(grp = cumsum(grp)) %>%
filter(n() > 1) %>%
ungroup
# gene_name gene_number grp
# <chr> <int> <int>
#1 ENSMUSG00000000001 4732 1
#2 ENSMUSG00000000001 4733 1
#3 ENSMUSG00000000058 7603 2
#4 ENSMUSG00000000058 7604 2
#5 ENSMUSG00000000058 8246 3
#6 ENSMUSG00000000058 8248 3
For each gene_name
subtract the current gene_number
value with the previous one and increment the group count if the difference is greater than 2. Drop the row if a group has a single row in it.
data
df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001",
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058",
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L,
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)),
class = "data.frame", row.names = c(NA, -7L))
Upvotes: 1