Reputation: 717

Check sequences by id on a large data set R

I need to check if values for year are consecutive in a large data set.

This is how the data look:

b <- c(2011,2012,2010, 2009:2011, 2013,2015,2017, 2010,2010, 2011)
dat <- data.frame(cbind(a,b))
dat 

   a    b
1  1 2011
2  1 2012
3  1 2010
4  2 2009
5  2 2010
6  2 2011
7  3 2013
8  3 2015
9  3 2017
10 4 2010
11 4 2010
12 5 2011

This is the function I wrote. It works very well on the small data set. However the real data set is very large 200k ids and it is taking a very long time. What can I do to make it faster?


seqyears <- function(id, year, idlist) {
year <- as.numeric(year)
year_values <- year[id==idlist]
year_sorted <- year_values[order(year_values)]
year_diff <- diff(year_sorted)
answer <- unique(year_diff)

if(length(answer)==0) {return("single line")} else { # length 0 means that there is only value and hence no diff can be computed 
if(length(answer)==1 & answer==1) {return("sequence ok")}   else {
return("check sequence")}}
}

to get a vector of values


unlist(lapply(c(1:5), FUN=seqyears, id=dat$a, year=dat$b))

Upvotes: 4

Answers (4)

ThomasIsCoding

Reputation: 101783

A data.table option

setorder(setDT(dat), a, b)[, .(x = c("check sequence", "sequence ok")[1 + all(diff(b) == 1)]), a]

gives

   a              x
1: 1    sequence ok
2: 2    sequence ok
3: 3 check sequence
4: 4 check sequence
5: 5    sequence ok

Upvotes: 0

TarJae

Reputation: 78947

This might also work:

library(dplyr)
dat %>% 
  group_by(a) %>% 
  arrange(a,b) %>% 
  summarise(consecutive_sequence = ifelse(any(abs(b - lead(b)) ==1), TRUE, NA))

Output:

      a consecutive_sequence
  <dbl> <chr>               
1     1 YES                 
2     2 YES                 
3     3 NA                  
4     4 NA                  
5     5 NA

Upvotes: 3

akrun

Reputation: 887251

Using dplyr

library(dplyr)
dat %>% 
    arrange(a, z) %>%
    group_by(a) %>% 
    summarise(x = case_when(any(z - lag(z) != 1) ~ 'check sequence', 
      TRUE ~ 'sequence ok'))

Upvotes: 2

r2evans

Reputation: 160447

I'd think you can aggregate this more simply.

aggregate(dat$b, dat[,"a",drop=FALSE], function(z) any(diff(sort(z)) != 1))
#   a     x
# 1 1 FALSE
# 2 2 FALSE
# 3 3  TRUE
# 4 4  TRUE
# 5 5 FALSE

If you need it to be that string, an ifelse does what you need:

aggregate(dat$b, dat[,"a",drop=FALSE],
          function(z) ifelse(any(diff(sort(z)) != 1), "check sequence", "sequence ok"))
#   a              x
# 1 1    sequence ok
# 2 2    sequence ok
# 3 3 check sequence
# 4 4 check sequence
# 5 5    sequence ok

If you have the chance of repeated years (and that is acceptable), then you can change the inner anon-function from diff(sort(z)) to diff(sort(unique(z))).

Upvotes: 6

Check sequences by id on a large data set R

Answers (4)

Related Questions