lg929
lg929

Reputation: 234

Identify start date, end date, length of run of consecutive number, and transpose into new data frame

I have a set of data that looks like this:

          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1

From this, I am trying to create a new data frame that will show me what dates my runs of 1's occur as well as how long these runs are, with the column headers: 1) start date, 2) end date, and 3) length of run.

Ultimately, I want to create a data frame that looks like this from the data I have above:

  Start Date   End Date  Length of Run
1 2006-06-01 2006-06-03              3
2 2006-06-06 2006-06-07              2  

I have a few NA's in my data that I need to ignore throughout my data as well.

Upvotes: 5

Views: 1552

Answers (3)

josliber
josliber

Reputation: 44299

You could do this with dplyr, using mutate to convert missing boolean values to 0, group_by to compute groups with constant values of variable boolean, filter to limit to groups where boolean was set to 1 and where the group had more than one member, and then summarize to grab the relevant summary information. (I take a few extra steps to remove the grouping variable at the end).

library(dplyr)
dat %>%
  mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
  group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
  filter(boolean == 1 & n() > 1) %>%
  summarize("Start Date"=min(as.character(Date)),
            "End Date"=max(as.character(Date)),
            "Length of Run"=n()) %>%
  ungroup() %>%
  select(-matches("group"))
#   Start Date   End Date Length of Run
#        (chr)      (chr)         (int)
# 1 2006-06-01 2006-06-03             3
# 2 2006-06-06 2006-06-07             2

Data:

dat <- read.table(text="          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1", header=T)

Upvotes: 4

oshun
oshun

Reputation: 2349

Another answer using base, reformatting this answer's use of cumsum and diff.

#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]

x1$Date <- as.Date(x1$Date)  #Convert date from factor to Date class

#Starting at 0, if the difference between eligible dates is >1 day, 
#   add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <-  cumsum(c(0, diff(x1$Date) > 1))

#          Date boolean SeriesNo
#407 2006-06-01       1        0
#408 2006-06-02       1        0
#409 2006-06-03       1        0
#412 2006-06-06       1        1
#413 2006-06-07       1        1
#415 2006-06-09       1        2

# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <-  as.data.frame(as.list(
         aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz) 
         c(Start = min(zz), End= max(zz), Run = length(zz) ))
       )) #see note after this code block

#Output is in days since origin.  Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End   <- as.Date(x2$Date.End,   origin = "1970-01-01")

#  SeriesNo Date.Start   Date.End Date.Run
#1        0 2006-06-01 2006-06-03        3
#2        1 2006-06-06 2006-06-07        2
#3        2 2006-06-09 2006-06-09        1

A note on "buggy" output from aggregate: Using aggregate to apply several functions on several variables in one call

Upvotes: 1

Pierre L
Pierre L

Reputation: 28441

We can also use data.table to subset and cast the data as needed. First we create an id column with rleid(boolean). Next, subset the data according to the necessary conditions. Lastly, we create start, end, and run with the subsetted data:

library(data.table)
setDT(dat)[,id := rleid(boolean)][
  ,.SD[.N > 1 & boolean == 1],id][
  ,.(start=Date[1],end=Date[.N], run=.N),id]
#   id      start        end run
#1:  1 2006-06-01 2006-06-03   3
#2:  4 2006-06-06 2006-06-07   2

Upvotes: 3

Related Questions