Reputation: 2096
I have a data frame df
which contains data from a car selling company. The data frame contains the dates and the number of sells for specific dates. Each salesperson has a staff_id
. A dummy inital_sell
indicates which day is the first working day of the person.
Now I want to add a column months_since_start
which adds for each day the month since the person started to work. Then I can use the sells
and the months_since_start
columns to plot the mean sells for each month since the salespersons started to work (sells for the 1st month for each salesperson, in the 2nd month...). As some days and months are missing (e.g., during holidays as shown in the bottom part of the example), I can't simply add a sequence to get months_since_start
.
date year month staff_id sells initial_sell months_since_start
2014-11-11 2014 11 1 3 1 1
2014-11-12 2014 11 1 1 0 1
2014-11-14 2014 11 1 1 0 1
2014-11-15 2014 11 1 2 0 1
...
2014-12-10 2014 12 1 2 0 1
2014-12-11 2014 12 1 1 0 2
...
2014-12-23 2014 12 2 1 1 1
2015-02-02 2015 2 2 4 0 2
2015-02-03 2015 2 2 1 0 2
...
2015-03-23 2015 3 2 3 0 4
...
Can someone help me how to get the month_since_start
column?
Upvotes: 2
Views: 557
Reputation: 269441
The input is assumed to be sorted by staff_id
and date
, as in the question, and shown at the end in the Note. Define a months
function which given a sorted vector of dates for a staff member returns the months since start (i.e. since the first date) for that member. Then use tapply
to apply it to each staff member. tapply
returns a list sorted by staff_id
so unravel it with unlist
. No packages are used.
Months <- function(date) {
with(as.POSIXlt(date), 12 * (year - year[1]) + (mon - mon[1]) + (mday >= mday[1]))
}
transform(DF, months_since_start = unlist(tapply(date, staff_id, FUN = Months)))
giving:
date year month staff_id sells initial_sell months_since_start
1 2014-11-11 2014 11 1 3 1 1
2 2014-11-12 2014 11 1 1 0 1
3 2014-11-14 2014 11 1 1 0 1
4 2014-11-15 2014 11 1 2 0 1
5 2014-12-10 2014 12 1 2 0 1
6 2014-12-11 2014 12 1 1 0 2
7 2014-12-23 2014 12 2 1 1 1
8 2015-02-02 2015 2 2 4 0 2
9 2015-02-03 2015 2 2 1 0 2
10 2015-03-23 2015 3 2 3 0 4
Alternative An alternative using ave
instead of tapply
is as follows. Months
is as above. MonthsDF
invokes Months
but accepts row numbers rather than the date itself. This solution still assumes that the data is sorted by date
within staff_id
but since ave
returns its output in the same order as the input it need not be sorted by staff_id
. The downside of ave
is that it does not handle "Date"
class data in the manner needed here which is why we used row numbers as the input to MonthsDF
:
MonthsDF <- function(ix) Months(DF$date[ix])
transform(DF, months_since_start = ave(seq_along(date), staff_id, FUN = MonthsDF))
Note: This input was used:
Lines <- "date year month staff_id sells initial_sell
2014-11-11 2014 11 1 3 1
2014-11-12 2014 11 1 1 0
2014-11-14 2014 11 1 1 0
2014-11-15 2014 11 1 2 0
2014-12-10 2014 12 1 2 0
2014-12-11 2014 12 1 1 0
2014-12-23 2014 12 2 1 1
2015-02-02 2015 2 2 4 0
2015-02-03 2015 2 2 1 0
2015-03-23 2015 3 2 3 0"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)
# in the question the input is already sorted by staff_id and date so
# the next two lines are not really needed but if we had non-sorted data
# then we should first sort it like this to be in the same form as in question
o <- with(DF, order(staff_id, date))
DF <- DF[o, ]
Upvotes: 3
Reputation: 43334
An approach using dplyr
and lubridate
:
library(dplyr)
library(lubridate)
# some sample data
df <- data.frame(date = rep(seq(as.Date('2014-01-01'), as.Date('2014-04-04'), by = 30), 3),
staff_id = rep(1:3, each = 4))
so df
looks like:
> head(df)
date staff_id
1 2014-01-01 1
2 2014-01-31 1
3 2014-03-02 1
4 2014-04-01 1
5 2014-01-01 2
6 2014-01-31 2
Now use dplyr
to group_by
staff_id
, and then use mutate
to add a column. Within mutate
, set months_since_start
to the time_length
of the interval
from the min
imum date (for each staff_id
, thanks to group_by
) and the date
of each row. Set the unit
of time_length
to month
.
df %>% group_by(staff_id) %>%
mutate(months_since_start = time_length(interval(min(date),
date),
unit = 'month'))
You get:
Source: local data frame [12 x 3]
Groups: staff_id [3]
date staff_id months_since_start
(date) (int) (dbl)
1 2014-01-01 1 0.0000000
2 2014-01-31 1 0.9677419
3 2014-03-02 1 2.0322581
4 2014-04-01 1 3.0000000
5 2014-01-01 2 0.0000000
6 2014-01-31 2 0.9677419
7 2014-03-02 2 2.0322581
8 2014-04-01 2 3.0000000
9 2014-01-01 3 0.0000000
10 2014-01-31 3 0.9677419
11 2014-03-02 3 2.0322581
12 2014-04-01 3 3.0000000
If you'd like completed months, wrap time_length
in floor
.
Upvotes: 2