Reputation: 3088
I have a data.frame that looks like this:
data=data.frame(position=c(1,2,3,1,1,4,5,6,7,8,2,2),
name=c("A","B","C","A","A","D","E","F","G","H","B","B"))
position name
1 A
2 B
3 C
1 A
1 A
4 D
5 E
6 F
7 G
8 H
2 B
2 B
I would like to be able to identify in the column "position" all the consecutive intervals and then paste into a new column the length of each interval.
I would like my data to look somehow like this.
position length
1 - 3 3
4 - 8 5
Any help and comment are highly appreciated
Upvotes: 0
Views: 887
Reputation: 9087
Here is a base R solution.
Create a column, sequence
, which indicates which rows are contiguous.
data$sequence <- c(NA, head(data$position, -1)) + 1 == data$position
data$sequence[[1]] <- data$sequence[[2]]
data
#> position name sequence
#> 1 1 A TRUE
#> 2 2 B TRUE
#> 3 3 C TRUE
#> 4 1 A FALSE
#> 5 1 A FALSE
#> 6 4 D FALSE
#> 7 5 E TRUE
#> 8 6 F TRUE
#> 9 7 G TRUE
#> 10 8 H TRUE
#> 11 2 B FALSE
#> 12 2 B FALSE
Use rle
to construct the run lengths.
run_lengths <- rle(data$sequence)
i_ends <- cumsum(run_lengths$lengths)[run_lengths$values]
i_starts <- c(1, head(i_ends, -1))
data.frame(
position = paste0(data$position[i_starts], " - ", data$position[i_ends]),
length = i_ends - i_starts
)
#> position length
#> 1 1 - 3 2
#> 2 3 - 8 7
Upvotes: 1
Reputation: 11584
Does this work:
library(dplyr)
library(tidyr)
library(data.table)
data %>% mutate(ID = case_when (position == lead(position) - 1 ~ 1, TRUE ~ 0)) %>%
mutate(ID = case_when(position == lag(position) + 1 ~ 1, TRUE ~ ID)) %>% mutate(r = rleid(ID)) %>% filter(ID == 1) %>%
group_by(r) %>% mutate(position = paste(min(position),max(position), sep = '-'), length = length(unique(name))) %>% ungroup() %>% select(1,5) %>% distinct()
# A tibble: 2 x 2
position length
<chr> <int>
1 1-3 3
2 4-8 5
>
Upvotes: 1