jeffrey
jeffrey

Reputation: 2096

Count number of months since start in a data frame

I have a data frame df which contains data from a car selling company. The data frame contains the dates and the number of sells for specific dates. Each salesperson has a staff_id. A dummy inital_sell indicates which day is the first working day of the person.

Now I want to add a column months_since_start which adds for each day the month since the person started to work. Then I can use the sells and the months_since_start columns to plot the mean sells for each month since the salespersons started to work (sells for the 1st month for each salesperson, in the 2nd month...). As some days and months are missing (e.g., during holidays as shown in the bottom part of the example), I can't simply add a sequence to get months_since_start.

date        year    month   staff_id   sells  initial_sell   months_since_start
2014-11-11  2014    11      1          3      1              1
2014-11-12  2014    11      1          1      0              1
2014-11-14  2014    11      1          1      0              1
2014-11-15  2014    11      1          2      0              1
...                     
2014-12-10  2014    12      1          2      0              1
2014-12-11  2014    12      1          1      0              2
...                     
2014-12-23  2014    12      2          1      1              1
2015-02-02  2015    2       2          4      0              2
2015-02-03  2015    2       2          1      0              2
...                     
2015-03-23  2015    3       2          3      0              4
...

Can someone help me how to get the month_since_start column?

Upvotes: 2

Views: 557

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269441

The input is assumed to be sorted by staff_id and date, as in the question, and shown at the end in the Note. Define a months function which given a sorted vector of dates for a staff member returns the months since start (i.e. since the first date) for that member. Then use tapply to apply it to each staff member. tapply returns a list sorted by staff_id so unravel it with unlist. No packages are used.

Months <- function(date) {
  with(as.POSIXlt(date), 12 * (year - year[1]) + (mon - mon[1]) + (mday >= mday[1]))
}

transform(DF, months_since_start = unlist(tapply(date, staff_id, FUN = Months)))

giving:

         date year month staff_id sells initial_sell months_since_start
1  2014-11-11 2014    11        1     3            1                  1
2  2014-11-12 2014    11        1     1            0                  1
3  2014-11-14 2014    11        1     1            0                  1
4  2014-11-15 2014    11        1     2            0                  1
5  2014-12-10 2014    12        1     2            0                  1
6  2014-12-11 2014    12        1     1            0                  2
7  2014-12-23 2014    12        2     1            1                  1
8  2015-02-02 2015     2        2     4            0                  2
9  2015-02-03 2015     2        2     1            0                  2
10 2015-03-23 2015     3        2     3            0                  4

Alternative An alternative using ave instead of tapply is as follows. Months is as above. MonthsDF invokes Months but accepts row numbers rather than the date itself. This solution still assumes that the data is sorted by date within staff_id but since ave returns its output in the same order as the input it need not be sorted by staff_id. The downside of ave is that it does not handle "Date" class data in the manner needed here which is why we used row numbers as the input to MonthsDF:

MonthsDF <- function(ix) Months(DF$date[ix])
transform(DF, months_since_start = ave(seq_along(date), staff_id, FUN = MonthsDF))

Note: This input was used:

Lines <- "date        year    month   staff_id   sells  initial_sell   
2014-11-11  2014    11      1          3      1              
2014-11-12  2014    11      1          1      0              
2014-11-14  2014    11      1          1      0              
2014-11-15  2014    11      1          2      0                            
2014-12-10  2014    12      1          2      0              
2014-12-11  2014    12      1          1      0              
2014-12-23  2014    12      2          1      1              
2015-02-02  2015    2       2          4      0              
2015-02-03  2015    2       2          1      0              
2015-03-23  2015    3       2          3      0"

DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)

# in the question the input is already sorted by staff_id and date so
# the next two lines are not really needed but if we had non-sorted data
# then we should first sort it like this to be in the same form as in question
o <- with(DF, order(staff_id, date))
DF <- DF[o, ]

Upvotes: 3

alistaire
alistaire

Reputation: 43334

An approach using dplyr and lubridate:

library(dplyr)
library(lubridate)
# some sample data
df <- data.frame(date = rep(seq(as.Date('2014-01-01'), as.Date('2014-04-04'), by = 30), 3),
                 staff_id = rep(1:3, each = 4))

so df looks like:

> head(df)
        date staff_id
1 2014-01-01        1
2 2014-01-31        1
3 2014-03-02        1
4 2014-04-01        1
5 2014-01-01        2
6 2014-01-31        2

Now use dplyr to group_by staff_id, and then use mutate to add a column. Within mutate, set months_since_start to the time_length of the interval from the minimum date (for each staff_id, thanks to group_by) and the date of each row. Set the unit of time_length to month.

df %>% group_by(staff_id) %>% 
  mutate(months_since_start = time_length(interval(min(date), 
                                                   date), 
                                          unit = 'month'))

You get:

Source: local data frame [12 x 3]
Groups: staff_id [3]

         date staff_id months_since_start
       (date)    (int)              (dbl)
1  2014-01-01        1          0.0000000
2  2014-01-31        1          0.9677419
3  2014-03-02        1          2.0322581
4  2014-04-01        1          3.0000000
5  2014-01-01        2          0.0000000
6  2014-01-31        2          0.9677419
7  2014-03-02        2          2.0322581
8  2014-04-01        2          3.0000000
9  2014-01-01        3          0.0000000
10 2014-01-31        3          0.9677419
11 2014-03-02        3          2.0322581
12 2014-04-01        3          3.0000000

If you'd like completed months, wrap time_length in floor.

Upvotes: 2

Related Questions