Kepasere
Kepasere

Reputation: 87

R - search for numerical sequences

I have a data frame looking like this:

   nr  grp start stop l1 ratio
   11   1   300  350  +   1.0                  
   12   1   400  450  -   0.8                  
   13   1    50  550  +   1.0                  
   14   1   600  650  -   1.0                  
   21   1   800  850  -   1.0                  
   36   1  1000 1050  +   0.0       
   37   1  1100 1200  +   0.9
   38   1  1250 1300  -   0.7
   39   1  1350 1400  +   1.0

and I have to find sequences of consecutive numbers and move it to a new df. I need to pull out the whole first row of the sequence and replace only the value for the stop column taken from the last row of the sequence.

The final df should look like this:

  nr  grp start stop l1 ratio
  11   1   300  650   +   1.0                  
  36   1   1000 1400  -   0.8 

I tried to do it this way:

t1<- read.table('aa.txt',sep = "\t", header=TRUE)
head(t1)
t1$chk <- NA
dl <- length(t1$nr)
for (i in 1:dl){
  if(isTRUE(t1$nr[i]+1 == t1$nr[i+1])){
  t1$chk[i] <- "t"  
  }else{
    t1$chk[i] <- 'F'
    }
}

and I receive this:

   nr  grp start stop l1 ratio chk
   11   1   300  350  +   1.0   t             
   12   1   400  450  -   0.8   t              
   13   1    50  550  +   1.0   t              
   14   1   600  650  -   1.0   F              
   21   1   800  850  -   1.0   F              
   36   1  1000 1050  +   0.0   t   
   37   1  1100 1200  +   0.9   t
   38   1  1250 1300  -   0.7   t
   39   1  1350 1400  +   1.0   F

After that, I wanted to move all rows which in the chk column have "t" to a new df. Unfortunately, I have a problem because the last number of sequences is not included in it. Does anyone have an idea how to solve it?

Upvotes: 1

Views: 69

Answers (2)

Roland
Roland

Reputation: 132706

Easy with data.table:

DT <- fread("   nr  grp start stop l1 ratio
            11   1   300  350  +   1.0                  
            12   1   400  450  -   0.8                  
            13   1    50  550  +   1.0                  
            14   1   600  650  -   1.0                  
            21   1   800  850  -   1.0                  
            36   1  1000 1050  +   0.0       
            37   1  1100 1200  +   0.9
            38   1  1250 1300  -   0.7
            39   1  1350 1400  +   1.0")

setDT(DT) #if you haven't imported with fread

#create group ID, here for didactic reason
DT[, groups := cumsum(c(TRUE, diff(nr) != 1))]

#take first row and replace stop from last row
DT[, if (.N > 1) {
  res <- .SD[1]
  res$stop <- .SD[.N, stop]
  res
  } else NULL, by = groups]
#   groups nr grp start stop l1 ratio
#1:      1 11   1   300  650  +     1
#2:      3 36   1  1000 1400  +     0

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

We can create groups for every change in sequence using diff and select only first row from each group if there are more than 1 row in the group and update the value of stop to last value in the group.

library(dplyr)

df %>%
   group_by(group = cumsum(c(1, diff(nr) != 1))) %>%
   mutate(stop = last(stop)) %>%
   filter(n() > 1 & row_number() == 1) %>%
   ungroup() %>%
   select(-group)

#     nr   grp start  stop l1    ratio
#  <int> <int> <int> <int> <fct> <dbl>
#1    11     1   300   650 +         1
#2    36     1  1000  1400 +         0

Upvotes: 1

Related Questions