Reputation: 410
I've seen a lot of questions on here about vectorising for loops, but couldn't find any that involve vectorising a for loop to populate a cell based on the value of a cell in a row below (apologies if I'm just being blind though...).
I have a dataframe with 1.6 million rows of salaries and the date each person started earning that salary. Each person can have multiple salaries, and so multiple rows, each with a different date that it was updated.
Code for a dummy dataset is as follows:
df1 <- data.frame("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(df1$id,df1$start_date),]
I want to create a column with an end date for each salary, which is calculated as the day before the subsequent salary entry. If there is no subsequent salary entry, then it's set as today's date. This is my code, including a for loop, to do that:
df1$end_date <- Sys.Date()
for (i in 1:(nrow(df1)-1)){
if(df1[i,1]== df1[i+1,1]){
df1[i,4] <- df1[i+1,3]-1
}
print(i)
}
However, I know that for
loops are not the most efficient way, but how would I go about vectorising this?
Upvotes: 0
Views: 60
Reputation: 38500
If I understand your question, the following base R code will work.
df1$end <- ave(df1$start_date, df1$id, FUN=function(x) c(tail(x, -1) - 1, Sys.Date()))
ave
is used to perform the group level operation. The function performed takes the second through final date and subtracts 1. This is concatenated with the final date.
This returns
df1
id salary start_date end
1 1 15456 2016-03-20 2016-12-06
2 1 16594 2016-12-07 2017-05-22
3 2 17364 2016-10-17 2016-07-27
4 2 34564 2016-07-28 2017-05-22
5 3 34525 2016-05-26 2016-05-01
6 3 33656 2016-05-02 2017-05-22
7 4 23464 2017-04-17 2016-01-19
8 4 23467 2016-01-20 2017-05-22
9 5 16794 2016-09-12 2016-05-06
10 5 27454 2016-05-07 2017-05-22
11 6 40663 2016-10-03 2016-03-28
12 6 42743 2016-03-29 2017-05-22
Upvotes: 0
Reputation: 3875
Using the dplyr
package, you could do:
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(end_date=lead(start_date-1,default=Sys.Date()))
Which returns:
id salary start_date end_date
<dbl> <dbl> <date> <date>
1 1 15456 2016-02-14 2016-03-02
2 1 16594 2016-03-03 2017-05-22
3 2 17364 2016-01-17 2016-11-28
4 2 34564 2016-11-29 2017-05-22
5 3 33656 2016-08-17 2016-11-25
6 3 34525 2016-11-26 2017-05-22
7 4 23464 2016-01-20 2017-05-05
8 4 23467 2017-05-06 2017-05-22
9 5 27454 2016-02-29 2016-12-15
10 5 16794 2016-12-16 2017-05-22
11 6 42743 2016-03-14 2017-01-29
12 6 40663 2017-01-30 2017-05-22
Upvotes: 2
Reputation: 17689
You can use library(data.table)
:
setDT(df1)[, end_date := shift(start_date, type = "lead", fill = Sys.Date()), id][]
Upvotes: 1
Reputation: 2365
With data.table and shift, you can use below:
df1 <- data.table("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(id,start_date),]
df1[, EndDate := shift(start_date, type="lead"), id]
df1[is.na(EndDate), EndDate := Sys.Date()]
Upvotes: 0