Sharif Amlani
Sharif Amlani

Reputation: 1278

How to capture the range of non-sequential numbers in r

I'm working with time-series data that looks like this:

     Missing_Years       cname
1263          1960 Afghanistan
1264          1961 Afghanistan
1265          1962 Afghanistan
1266          1963 Afghanistan
1267          1964 Afghanistan
1268          1997 Afghanistan
1269          1998 Afghanistan
1270          1999 Afghanistan
1271          2000 Afghanistan
1272          2001 Afghanistan
1273          2002 Afghanistan
1274          2003 Afghanistan
1275          2004 Afghanistan
1276          2005 Afghanistan

What I would like is a data set that notes the name of the country and the range of the years. If you notice, the dates are not sequential. So using the range function alone would be inaccurate. This is the tricky part.

I would like the output to look something like this:

   cname          Missing_Years
Afghanistan       1960 - 1964 
Afghanistan       1997 - 2005 

Please note that I'll be looping over countries. So any code that can be fashioned to do so would be much appreciated.

Thanks so much for the help!

Upvotes: 1

Views: 183

Answers (2)

akrun
akrun

Reputation: 886938

We can use data.table as well

library(data.table)
setDT(df)[, .(cname = first(cname), Missing_Years = paste(range(Missing_Years),
      collapse="_")), .(grp = cumsum(Missing_Years - shift(Missing_Years, 
         fill = first(Missing_Years)) > 1 ))][, grp := NULL][]
#         cname Missing_Years
#1: Afghanistan     1960_1964
#2: Afghanistan     1997_2005

data

df <- structure(list(Missing_Years = c(1960L, 1961L, 1962L, 1963L, 
1964L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L), cname = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor")), class = 
"data.frame", row.names = c(NA,-14L))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388797

You can find out the rows where sequence breaks using diff and use it as a grouping variable.

df$group <- cumsum(c(TRUE, diff(df$Missing_Years) > 1))
aggregate(Missing_Years~cname+group, df, function(x) paste(range(x),collapse = "-"))

Or using dplyr we can compare the values using lag.

library(dplyr)

df %>%
  group_by(cname,grp = cumsum(Missing_Years-lag(Missing_Years,default = 0) > 1)) %>%
  summarise(Missing_Years = paste(range(Missing_Years),  collapse = "-")) %>%
  select(-grp)

#  cname       Missing_Years
#  <fct>       <chr>        
#1 Afghanistan 1960-1964    
#2 Afghanistan 1997-2005    

data

df <- structure(list(Missing_Years = c(1960L, 1961L, 1962L, 1963L, 
1964L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L), cname = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor")), class = 
"data.frame", row.names = c(NA,-14L))

Upvotes: 1

Related Questions