Compute date difference between rows considering different groups inside dataframe

Question

I am working with some information from treatments applied to animals in R. First I would like to describe the structure of my information (I will add the dput() version in the end). My data is DF and it looks like this:

  Treatment_ID Start_Date      Valid
1         0031 2011-05-01 2011-05-30
2         0031 2011-05-01 2011-06-30
3         0045 2012-02-01 2012-03-01
4         0057 2012-04-01 2012-04-30
5         0057 2012-04-01 2012-05-30
6         0098 2012-10-01 2012-10-30

It has 56 rows and three variables Treatment_ID (5 types of treatments), Start_Date (Date of start of treatment) and Valid (Final date of the treatment). For example the Treatment_ID 0031 has two observations because this started on May 2011 and finished on June 2011. Then a new treatment 0045started on February 2012 and finished on March 2012 (Only one observation). The same structure applies for all the groups inside DF. I need to compute the difference of months between each treatment and behind each treatment using some conditions. I will use the first two treatments to show this:

  Treatment_ID Start_Date      Valid
1         0031 2011-05-01 2011-05-30
2         0031 2011-05-01 2011-06-30
3         0045 2012-02-01 2012-03-01
4         0057 2012-04-01 2012-04-30

For this example, I have two rows for the first treatment where the Treatment_Id variable is equal. When that occurs the difference in months for the variable Valid must be calculated. When a new treatment appears then the difference in months between Start_Date and Valid must be calculated. Notice that when a treatment has more than one observation the difference is gotten by using Valid variable for the observations in that group but when the Treatment_Id changes therefore the difference must be obtained by using Start_Date and Valid variables. To get this variable Break_Months I used the next structure:

DF$Break_Months=NA

for(i in c(2:(length(DF$Break_Months))))
{
  DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),
                            round(as.numeric(DF$Start_Date[i]-DF$Valid[i-1])/30,0))
}

This for when the Treatment_Id are equal computes the difference between the actual row and the previous with Valid variable and when they are different the difference is computed using Start_Date and Valid. The first value of Break_Months is NA because there is not a previous value to compare. The problem appeared in the end of DF when I used the previous lines of code.

   Treatment_ID Start_Date      Valid Break_Months
47         0098 2012-10-01 2016-07-30            1
48         0098 2012-10-01 2016-08-31            1
49         0031 2016-09-01 2016-09-30            0
50         0031 2016-09-01 2016-10-30            1
51         0031 2016-09-01 2016-11-30            1
52         0031 2016-09-01 2016-12-30            1
53         0031 2016-09-01 2017-01-30            1
54         0031 2016-09-01 2017-03-02            1
55         0031 2016-09-01 2017-03-30            1
56         0012 2017-03-01 2017-03-30           -1

The Treatment_Id 0012 has only one observation because it is new and the Valid date is the same that the last observation of treatment 0031. Because the Treatment_Id 0031 has been used in other months then difference is computed with the previous observations inside the treatment. In the case of 0012 this is not possible because the Valid date of this is the same that the last observation of 0031 and 0012 does not have more observations because it is new. When that happens the comparison must be done with the last observation of the group previous to 0031 this is 0098. By using the concept due to 0012 is not equal to 0098 the Break_Months is computed by the difference between 2017-03-01 (Start_Date) and 2016-08-31 (Valid) giving a value of 6 by the same mechanic in the for structure and not -1.

My question is related to how to incorporate this consideration inside the for. It has been very complex to try to do this because I do not know how to integrate a comparison related to the date (if they are equal as in he example) and looking for the previous group before the one that contains the same date. I have tried using lag function from dplyr package to avoid the for but the results are not the same. The dput() version of DF is the next:

DF<-structure(list(Treatment_ID = c("0031", "0031", "0045", "0057", 
"0057", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0098", "0098", "0098", "0098", 
"0098", "0098", "0098", "0098", "0031", "0031", "0031", "0031", 
"0031", "0031", "0031", "0012"), Start_Date = structure(c(1304208000, 
1304208000, 1328054400, 1333238400, 1333238400, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 
1349049600, 1349049600, 1349049600, 1349049600, 1349049600, 1472688000, 
1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 1472688000, 
1488326400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Valid = structure(c(1306713600, 1309392000, 1330560000, 1335744000, 
    1338336000, 1351555200, 1354233600, 1356825600, 1359504000, 
    1362182400, 1364601600, 1367280000, 1369872000, 1372550400, 
    1375142400, 1377820800, 1380499200, 1383091200, 1385769600, 
    1388361600, 1391040000, 1393718400, 1396137600, 1398816000, 
    1401408000, 1404086400, 1412035200, 1414627200, 1417305600, 
    1419897600, 1422576000, 1425254400, 1427673600, 1432944000, 
    1435622400, 1440892800, 1443571200, 1446163200, 1448841600, 
    1451433600, 1454112000, 1456790400, 1459296000, 1461974400, 
    1464566400, 1467244800, 1469836800, 1472601600, 1475193600, 
    1477785600, 1480464000, 1483056000, 1485734400, 1488412800, 
    1490832000, 1490832000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Treatment_ID", 
"Start_Date", "Valid"), row.names = c(NA, -56L), class = "data.frame")

Thanks for your help.

Update One of these solutions worked perfect. Now I have a little issue when I have to compute a similar variables. First I compute the variable Elapsed that is the difference between Valid and Start_Date. I use the next code:

DF$Elapsed=round(as.numeric(DF$Valid-DF$Start_Date)/30,0)

Then, the dilemma appears. I have to compute the next two variables Last1 and Last2. I use the next code for this:

#Compute Last1
DF$Last1=NA
DF$Last1[1]=0
for(j in c(2:length(DF$Last1)))
{
  DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1],
                     ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[j-1],0))
}

The code worked parcially because I have a similar problem related to the variable Break_Months. In this case because 0031 and 0012 have the same Valid value the comparison of Treatment_Id must not be done with the last value of 0031 where 7 is assigned due to the logic of the loop (variable Elapsed). In this case the correct value is 48 because the comparison must be structured with the last observation of group 0098 then we get the 48. I have tried to modify with last_obs_index but I can not get the correct result.

   Treatment_ID Start_Date      Valid Break_Months Elapsed Last1
47         0098 2012-10-01 2016-07-30            1      47     2
48         0098 2012-10-01 2016-08-31            1      48     2
49         0031 2016-09-01 2016-09-30            0       1    48
50         0031 2016-09-01 2016-10-30            1       2    48
51         0031 2016-09-01 2016-11-30            1       3    48
52         0031 2016-09-01 2016-12-30            1       4    48
53         0031 2016-09-01 2017-01-30            1       5    48
54         0031 2016-09-01 2017-03-02            1       6    48
55         0031 2016-09-01 2017-03-30            1       7    48
56         0012 2017-03-01 2017-03-30            6       1     7

For the variable Last2 I use the next code:

#Compute Last2
DF$Last2=NA
DF$Last2[1]=0
for(k in c(2:length(DF$Last2)))
{
  DF$Last2[k]=ifelse(DF$Treatment_ID[k]==DF$Treatment_ID[k-1],DF$Last2[k-1],
                     ifelse(DF$Treatment_ID[k]!=DF$Treatment_ID[k-1],DF$Break_Months[k],0))
}

In this case it seems that worked but that is not true. Despite the fact that 6 is correct, the comparison is not well defined because 0012 and 0031 has the same Valid date and the optimal comparison is using the last observation of 0098 group.Therefore, the value of Break_Months is assigned. Again I could not fix the loop with the proper logic defined with last_obs_index.

   Treatment_ID Start_Date      Valid Break_Months Elapsed Last1 Last2
47         0098 2012-10-01 2016-07-30            1      47     2     4
48         0098 2012-10-01 2016-08-31            1      48     2     4
49         0031 2016-09-01 2016-09-30            0       1    48     0
50         0031 2016-09-01 2016-10-30            1       2    48     0
51         0031 2016-09-01 2016-11-30            1       3    48     0
52         0031 2016-09-01 2016-12-30            1       4    48     0
53         0031 2016-09-01 2017-01-30            1       5    48     0
54         0031 2016-09-01 2017-03-02            1       6    48     0
55         0031 2016-09-01 2017-03-30            1       7    48     0
56         0012 2017-03-01 2017-03-30            6       1     7     6

Thanks for all the help this time, would it be possible to get advice about how to adapt the loops to get the comparisons in the right way.

johnckane · Accepted Answer

To do this in a for loop you'll need to add an additional criteria to your calculation when the value of Treatment_ID is different from the previous observation.

If the value of Treatment_ID is not the same as the previous observation, compute the difference between Start_Date for this observation and the value of Valid for the last observation of the most recent Treatment_ID where the last value of Valid is also different.

To do this you need to know the indices of DF where the values of Treatment_ID change and the values of Valid change. You'll need the Lag function from Hmisc

library(Hmisc)
new_obs_index <- which(DF$Treatment_ID != Lag(DF$Treatment_ID,1) & DF$Valid != Lag(DF$Valid))

This provides the index of where the new observations start, we actually want the index of the last observation for the last Treatment_ID prior to this.

last_obs_index <- new_obs_index - 1

These are the indices of the values of Valid that meet the criteria of last observation for Treatment_ID such that the value of Valid changes in the next observation as well.

Then in the for loop when the value of Treatment_ID changes subtract the difference between Start and the most recent value of Valid that meets our criteria. We accomplish this by specifying we want

DF$Valid[last_obs_index[max(which(last_obs_index < i))]]

So the for loop looks like this:

for(i in c(2:(length(DF$Break_Months)))){
  DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],
round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),round(as.numeric(DF$Start_Date[i]-DF$Valid[last_obs_index[max(which(last_obs_index < i))]])/30,0))
}

This gives the desired result in the last observation in DF.

   Treatment_ID Start_Date      Valid Break_Months
51         0031 2016-09-01 2016-11-30            1
52         0031 2016-09-01 2016-12-30            1
53         0031 2016-09-01 2017-01-30            1
54         0031 2016-09-01 2017-03-02            1
55         0031 2016-09-01 2017-03-30            1
56         0012 2017-03-01 2017-03-30            6

So altogether, the necessary code to implement is

library(Hmisc)
new_obs_index <- which(DF$Treatment_ID != Lag(DF$Treatment_ID,1) & DF$Valid != Lag(DF$Valid))
last_obs_index <- new_obs_index - 1 
for(i in c(2:(length(DF$Break_Months)))){
DF$Break_Months[i]=ifelse(DF$Treatment_ID[i]==DF$Treatment_ID[i-1],round(as.numeric(DF$Valid[i]-DF$Valid[i-1])/30,0),round(as.numeric(DF$Start_Date[i]-DF$Valid[last_obs_index[max(which(last_obs_index < i))]])/30,0))
}

UPDATE For the variable Last1 you can access the value you desire using the vector last_obs_index using this syntax:

for(j in c(2:length(DF$Last1))){
DF$Last1[j]=ifelse(DF$Treatment_ID[j]==DF$Treatment_ID[j-1],DF$Last1[j-1],ifelse(DF$Treatment_ID[j]!=DF$Treatment_ID[j-1],DF$Elapsed[last_obs_index[max(which(last_obs_index < i))]],0))}

For the variable Last2 if I understand you correctly I think your implementation will always give the answer you want. I think a new treatment value is sufficient to use the value of Break_Months and you don't also need a different value of Valid to use it.

Compute date difference between rows considering different groups inside dataframe

Answers (2)

Related Questions