tktk234
tktk234

Reputation: 410

Vectorising iterative operation across rows

I've seen a lot of questions on here about vectorising for loops, but couldn't find any that involve vectorising a for loop to populate a cell based on the value of a cell in a row below (apologies if I'm just being blind though...).

I have a dataframe with 1.6 million rows of salaries and the date each person started earning that salary. Each person can have multiple salaries, and so multiple rows, each with a different date that it was updated.

Code for a dummy dataset is as follows:

df1 <- data.frame("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
                  "salary" = c(15456,16594,
                               17364,34564,
                               34525,33656,
                               23464,23467,
                               16794,27454,
                               40663,42743),
                  "start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))

df1 <- df1[order(df1$id,df1$start_date),]

I want to create a column with an end date for each salary, which is calculated as the day before the subsequent salary entry. If there is no subsequent salary entry, then it's set as today's date. This is my code, including a for loop, to do that:

df1$end_date <- Sys.Date()

for (i in 1:(nrow(df1)-1)){
  if(df1[i,1]== df1[i+1,1]){
    df1[i,4] <- df1[i+1,3]-1
  }
  print(i)
} 

However, I know that for loops are not the most efficient way, but how would I go about vectorising this?

Upvotes: 0

Views: 60

Answers (4)

lmo
lmo

Reputation: 38500

If I understand your question, the following base R code will work.

df1$end <- ave(df1$start_date, df1$id, FUN=function(x) c(tail(x, -1) - 1, Sys.Date()))

ave is used to perform the group level operation. The function performed takes the second through final date and subtracts 1. This is concatenated with the final date.

This returns

df1
   id salary start_date        end
1   1  15456 2016-03-20 2016-12-06
2   1  16594 2016-12-07 2017-05-22
3   2  17364 2016-10-17 2016-07-27
4   2  34564 2016-07-28 2017-05-22
5   3  34525 2016-05-26 2016-05-01
6   3  33656 2016-05-02 2017-05-22
7   4  23464 2017-04-17 2016-01-19
8   4  23467 2016-01-20 2017-05-22
9   5  16794 2016-09-12 2016-05-06
10  5  27454 2016-05-07 2017-05-22
11  6  40663 2016-10-03 2016-03-28
12  6  42743 2016-03-29 2017-05-22

Upvotes: 0

Lamia
Lamia

Reputation: 3875

Using the dplyr package, you could do:

library(dplyr)
df1 %>% 
group_by(id) %>% 
mutate(end_date=lead(start_date-1,default=Sys.Date()))

Which returns:

id salary start_date   end_date
   <dbl>  <dbl>     <date>     <date>
 1     1  15456 2016-02-14 2016-03-02
 2     1  16594 2016-03-03 2017-05-22
 3     2  17364 2016-01-17 2016-11-28
 4     2  34564 2016-11-29 2017-05-22
 5     3  33656 2016-08-17 2016-11-25
 6     3  34525 2016-11-26 2017-05-22
 7     4  23464 2016-01-20 2017-05-05
 8     4  23467 2017-05-06 2017-05-22
 9     5  27454 2016-02-29 2016-12-15
10     5  16794 2016-12-16 2017-05-22
11     6  42743 2016-03-14 2017-01-29
12     6  40663 2017-01-30 2017-05-22

Upvotes: 2

Tonio Liebrand
Tonio Liebrand

Reputation: 17689

You can use library(data.table):

setDT(df1)[, end_date := shift(start_date, type = "lead", fill = Sys.Date()), id][]

Upvotes: 1

Sabri Karag&#246;nen
Sabri Karag&#246;nen

Reputation: 2365

With data.table and shift, you can use below:

df1 <- data.table("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
                  "salary" = c(15456,16594,
                               17364,34564,
                               34525,33656,
                               23464,23467,
                               16794,27454,
                               40663,42743),
                  "start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))

df1 <- df1[order(id,start_date),]

df1[, EndDate := shift(start_date, type="lead"), id]
df1[is.na(EndDate), EndDate := Sys.Date()]

Upvotes: 0

Related Questions