Count number of months since start in a data frame

Question

I have a data frame df which contains data from a car selling company. The data frame contains the dates and the number of sells for specific dates. Each salesperson has a staff_id. A dummy inital_sell indicates which day is the first working day of the person.

Now I want to add a column months_since_start which adds for each day the month since the person started to work. Then I can use the sells and the months_since_start columns to plot the mean sells for each month since the salespersons started to work (sells for the 1st month for each salesperson, in the 2nd month...). As some days and months are missing (e.g., during holidays as shown in the bottom part of the example), I can't simply add a sequence to get months_since_start.

date        year    month   staff_id   sells  initial_sell   months_since_start
2014-11-11  2014    11      1          3      1              1
2014-11-12  2014    11      1          1      0              1
2014-11-14  2014    11      1          1      0              1
2014-11-15  2014    11      1          2      0              1
...                     
2014-12-10  2014    12      1          2      0              1
2014-12-11  2014    12      1          1      0              2
...                     
2014-12-23  2014    12      2          1      1              1
2015-02-02  2015    2       2          4      0              2
2015-02-03  2015    2       2          1      0              2
...                     
2015-03-23  2015    3       2          3      0              4
...

Can someone help me how to get the month_since_start column?

G. Grothendieck · Accepted Answer

The input is assumed to be sorted by staff_id and date, as in the question, and shown at the end in the Note. Define a months function which given a sorted vector of dates for a staff member returns the months since start (i.e. since the first date) for that member. Then use tapply to apply it to each staff member. tapply returns a list sorted by staff_id so unravel it with unlist. No packages are used.

Months <- function(date) {
  with(as.POSIXlt(date), 12 * (year - year[1]) + (mon - mon[1]) + (mday >= mday[1]))
}

transform(DF, months_since_start = unlist(tapply(date, staff_id, FUN = Months)))

giving:

         date year month staff_id sells initial_sell months_since_start
1  2014-11-11 2014    11        1     3            1                  1
2  2014-11-12 2014    11        1     1            0                  1
3  2014-11-14 2014    11        1     1            0                  1
4  2014-11-15 2014    11        1     2            0                  1
5  2014-12-10 2014    12        1     2            0                  1
6  2014-12-11 2014    12        1     1            0                  2
7  2014-12-23 2014    12        2     1            1                  1
8  2015-02-02 2015     2        2     4            0                  2
9  2015-02-03 2015     2        2     1            0                  2
10 2015-03-23 2015     3        2     3            0                  4

Alternative An alternative using ave instead of tapply is as follows. Months is as above. MonthsDF invokes Months but accepts row numbers rather than the date itself. This solution still assumes that the data is sorted by date within staff_id but since ave returns its output in the same order as the input it need not be sorted by staff_id. The downside of ave is that it does not handle "Date" class data in the manner needed here which is why we used row numbers as the input to MonthsDF:

MonthsDF <- function(ix) Months(DF$date[ix])
transform(DF, months_since_start = ave(seq_along(date), staff_id, FUN = MonthsDF))

Note: This input was used:

Lines <- "date        year    month   staff_id   sells  initial_sell   
2014-11-11  2014    11      1          3      1              
2014-11-12  2014    11      1          1      0              
2014-11-14  2014    11      1          1      0              
2014-11-15  2014    11      1          2      0                            
2014-12-10  2014    12      1          2      0              
2014-12-11  2014    12      1          1      0              
2014-12-23  2014    12      2          1      1              
2015-02-02  2015    2       2          4      0              
2015-02-03  2015    2       2          1      0              
2015-03-23  2015    3       2          3      0"

DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)

# in the question the input is already sorted by staff_id and date so
# the next two lines are not really needed but if we had non-sorted data
# then we should first sort it like this to be in the same form as in question
o <- with(DF, order(staff_id, date))
DF <- DF[o, ]

Count number of months since start in a data frame

Answers (2)

Related Questions