Reputation: 311
This data is related to people conducting two activities with different time stamps. Let me explain the data in details.
I have two tables, one with the time people go for an exercise and another when they have taken their blood pressure and how many times (frequency).
Table 1: first column is the id of the person and second column is the date & time for the exercise.
id date_time
1 29-Oct-2016 6:01:03 PM
1 29-Oct-2016 7:34:17 PM
1 30-Oct-2016 2:08:03 PM
1 30-Oct-2016 3:55:12 PM
2 31-Oct-2016 11:32:12 AM
2 31-Oct-2016 2:59:56 PM
2 01-Nov-2016 12:49:44 PM
2 01-Nov-2016 1:55:16 PM
2 01-Nov-2016 7:18:22 PM
2 01-Nov-2016 8:40:48 PM
3 01-Nov-2016 9:19:50 PM
3 02-Nov-2016 2:20:15 PM
3 02-Nov-2016 6:52:27 PM
3 02-Nov-2016 7:39:32 PM
3 03-Nov-2016 8:55:41 AM
Table 2: It consists for the id and date & time information for the blood pressure test administered. The last column is the cumulative frequency of the tests taken so far.
id date_time Cumulative_frequency
1 30-Oct-2016 2:07:03 PM 2
1 30-Oct-2016 3:56:12 PM 3
2 31-Oct-2016 11:31:12 AM 5
2 31-Oct-2016 3:00:56 PM 7
3 02-Nov-2016 7:40:32 PM 0
3 03-Nov-2016 8:54:41 AM 5
I have to add the cumulative_frequency data from Table 2 to Table 1. What is needed is, for each id
it is required to know how many times blood pressure has been already taken until the exercise. The following table is the outcome which is required.
Table 3:
id date_time Cumulative_frequency
1 29-Oct-2016 6:01:03 PM 0
1 29-Oct-2016 7:34:17 PM 0
1 30-Oct-2016 2:08:03 PM 2
1 30-Oct-2016 3:55:12 PM 2
2 31-Oct-2016 11:32:12 AM 5
2 31-Oct-2016 2:59:56 PM 5
2 01-Nov-2016 12:49:44 PM 7
2 01-Nov-2016 1:55:16 PM 7
2 01-Nov-2016 7:18:22 PM 7
2 01-Nov-2016 8:40:48 PM 7
3 01-Nov-2016 9:19:50 PM 0
3 02-Nov-2016 2:20:15 PM 0
3 02-Nov-2016 6:52:27 PM 0
3 02-Nov-2016 7:39:32 PM 0
3 03-Nov-2016 8:55:41 AM 5
I am trying to do the following: first i sorted the df wrt date. Then I am creating two subsets from each data set for each id using a for loop. Then for each subset I am comparing the date using which
. That is for each date_time in dataset 1 which date_time are lower, followed by taking the maximum index. Using the index then I fetch the frequency. I have then used merge to obtain the Table 3. Can you advise, is there a more efficient way to go about it ?
Upvotes: 1
Views: 114
Reputation: 83215
The rolling join capability of the data.table
-package in combination with the cummax
-function might be a good approach for this problem:
# convert the 'date_time' variables to date-time class
df1$date_time <- as.POSIXct(df1$date_time, '%d-%b-%Y %I:%M:%S %p', tz = 'GMT')
df2$date_time <- as.POSIXct(df2$date_time, '%d-%b-%Y %I:%M:%S %p', tz = 'GMT')
# load the 'data.table'-package and convert the data.frame-s to data.table-s
library(data.table)
setDT(df1)
setDT(df2)
# several steps to get the desired result:
# rolling join (line 1)
# replace missing frequency values with zero (line 2)
# get the cummulative max by 'id' (line 3)
df1[df2, on = .(id, date_time), roll = -Inf, c_freq := Cumulative_frequency
][is.na(c_freq), freq := 0
][, c_freq := cummax(c_freq), id][]
this gives:
id date_time freq 1: 1 2016-10-29 18:01:03 0 2: 1 2016-10-29 19:34:17 0 3: 1 2016-10-30 14:08:03 2 4: 1 2016-10-30 15:55:12 2 5: 2 2016-10-31 11:32:12 5 6: 2 2016-10-31 14:59:56 5 7: 2 2016-11-01 12:49:44 7 8: 2 2016-11-01 13:55:16 7 9: 2 2016-11-01 19:18:22 7 10: 2 2016-11-01 20:40:48 7 11: 3 2016-11-01 21:19:50 0 12: 3 2016-11-02 14:20:15 0 13: 3 2016-11-02 18:52:27 0 14: 3 2016-11-02 19:39:32 0 15: 3 2016-11-03 08:55:41 5
An alternative which doesn't need cummax
(thanks to @DavidArenburg):
df1[, c_freq := df2[df1, Cumulative_frequency, on = .(id, date_time), roll = Inf]
][is.na(c_freq), cum_freq := 0][]
Upvotes: 3
Reputation: 740
You can first split the second data frame by id's into a list. For each exercise, first subset the list for the correct id and then using findInterval, get the closest match for the date in the exercise data frame:
times1 <- strptime(table1[,2], "%d-%b-%Y %I:%M:%S %p")
times2 <- strptime(table2[,2], "%d-%b-%Y %I:%M:%S %p")
table1a <- data.frame(table1[,1], times1)
names(table1a) <- c("id", "date_time")
table2a <- data.frame(table2[,1], times2, table2[,3])
names(table2a) <- c("id", "date_time", "cum_freq")
table2list <- split(table2a, table2a[,1])
getcumul <- function(x)
{
x <- table1a[x,]
df1 <- table2list[[x$id]]
cutt <- findInterval(x$date_time, df1[,2])
if (cutt == 0) return(0) else
{
cumul <- df1[cutt,3]
return(cumul)
}
}
cumuls <- function() sapply(1:nrow(table1a), getcumul)
With the split list, the whole data frame is not subset every time and with findInterval, there is no need to subset the whole data frame for freqencies again.
The cumuls function returns the same frequencies. Just cbind them to your data frame.
And why I used sapply instead of apply with 1st margin: Because apply returns a vector with values coerced to character while subsetting rows returns a 1 row data frame in which the POSIXct class is retained for the second item.
Upvotes: 1