Reputation: 1278
I'm working with time-series data that looks like this:
Missing_Years cname
1263 1960 Afghanistan
1264 1961 Afghanistan
1265 1962 Afghanistan
1266 1963 Afghanistan
1267 1964 Afghanistan
1268 1997 Afghanistan
1269 1998 Afghanistan
1270 1999 Afghanistan
1271 2000 Afghanistan
1272 2001 Afghanistan
1273 2002 Afghanistan
1274 2003 Afghanistan
1275 2004 Afghanistan
1276 2005 Afghanistan
What I would like is a data set that notes the name of the country and the range of the years. If you notice, the dates are not sequential. So using the range
function alone would be inaccurate. This is the tricky part.
I would like the output to look something like this:
cname Missing_Years
Afghanistan 1960 - 1964
Afghanistan 1997 - 2005
Please note that I'll be looping over countries. So any code that can be fashioned to do so would be much appreciated.
Thanks so much for the help!
Upvotes: 1
Views: 183
Reputation: 886938
We can use data.table
as well
library(data.table)
setDT(df)[, .(cname = first(cname), Missing_Years = paste(range(Missing_Years),
collapse="_")), .(grp = cumsum(Missing_Years - shift(Missing_Years,
fill = first(Missing_Years)) > 1 ))][, grp := NULL][]
# cname Missing_Years
#1: Afghanistan 1960_1964
#2: Afghanistan 1997_2005
df <- structure(list(Missing_Years = c(1960L, 1961L, 1962L, 1963L,
1964L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L), cname = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor")), class =
"data.frame", row.names = c(NA,-14L))
Upvotes: 1
Reputation: 388797
You can find out the rows where sequence breaks using diff
and use it as a grouping variable.
df$group <- cumsum(c(TRUE, diff(df$Missing_Years) > 1))
aggregate(Missing_Years~cname+group, df, function(x) paste(range(x),collapse = "-"))
Or using dplyr
we can compare the values using lag
.
library(dplyr)
df %>%
group_by(cname,grp = cumsum(Missing_Years-lag(Missing_Years,default = 0) > 1)) %>%
summarise(Missing_Years = paste(range(Missing_Years), collapse = "-")) %>%
select(-grp)
# cname Missing_Years
# <fct> <chr>
#1 Afghanistan 1960-1964
#2 Afghanistan 1997-2005
data
df <- structure(list(Missing_Years = c(1960L, 1961L, 1962L, 1963L,
1964L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L), cname = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "Afghanistan", class = "factor")), class =
"data.frame", row.names = c(NA,-14L))
Upvotes: 1