Reputation: 21
I have a df which consists of three columns, first one contains unique IDs, second one numeric Values and the last one date and time in POSIXct format. See below.
Ive tried to provide the first 10 observations using dput. I hope this works. Note that I had to delete the internal.selfref etc.
> dput(bp2s[1:10])
structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L,
100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L,
120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976682620,
5976684000, 5976687600, 5976691200, 5976694800, 5976698400, 5976785280,
5976788400, 5976790200, 5976792000), tzone = "", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Now I want to check each ID individually, whether at any point the VALUE stays below 100 for at least half an hour. The results should then be displayed in a new column ("BELOW") in Boolean with TRUE for < 100 for half an hour or more and FALSE if not. The Boolean value should be placed in the row of the starting point. The result should then look like this:
HADM_ID VALUE TIME BELOW
1: 100210 92 2159-05-26 08:40:00 TRUE
2: 100210 98 2159-05-26 09:00:00 FALSE
3: 100210 105 2159-05-26 09:12:00 FALSE
4: 100889 92 2166-08-15 14:50:00 FALSE
5: 100889 98 2166-08-15 15:00:00 FALSE
6: 100889 101 2166-08-15 15:15:00 FALSE
7: 100520 89 2133-02-03 14:15:00 TRUE
8: 100520 102 2133-02-03 15:15:00 FALSE
Using the UPDATED tidyverse proposal: Time diff is working and the first time the value stayed < 100 for 30 minutes+ is detected correctly(observation 7). Note, however, the next two observations(8 & 9) are still < 100 and stay < 100 and yet they are marked as FALSE.
HADM_ID VALUE time time_diff grp below
<int> <int> <dttm> <drtn> <int> <lgl>
1 100210 112 2159-05-24 15:37:00 23 mins 1 FALSE
2 100210 120 2159-05-24 16:00:00 60 mins 2 FALSE
3 100210 121 2159-05-24 17:00:00 60 mins 3 FALSE
4 100210 112 2159-05-24 18:00:00 60 mins 4 FALSE
5 100210 106 2159-05-24 19:00:00 60 mins 5 FALSE
6 100210 109 2159-05-24 20:00:00 1448 mins 6 FALSE
7 100210 80 2159-05-25 20:08:00 52 mins 6 TRUE
8 100210 89 2159-05-25 21:00:00 30 mins 6 FALSE #should be TRUE since Value stays under 100 for next 30 min
9 100210 85 2159-05-25 21:30:00 30 mins 6 FALSE #should be TRUE see above
10 100210 99 2159-05-25 22:00:00 30 mins 6 FALSE
Using the UPDATED data.table solution I get the following result which looks spot on.
HADM_ID VALUE time timenext grp BELOW
1: 100210 112 2159-05-24 15:37:00 2159-05-24 15:37:00 1 FALSE
2: 100210 120 2159-05-24 16:00:00 2159-05-24 16:00:00 1 FALSE
3: 100210 121 2159-05-24 17:00:00 2159-05-24 17:00:00 1 FALSE
4: 100210 112 2159-05-24 18:00:00 2159-05-24 18:00:00 1 FALSE
5: 100210 106 2159-05-24 19:00:00 2159-05-24 19:00:00 1 FALSE
6: 100210 109 2159-05-24 20:00:00 2159-05-24 20:00:00 1 FALSE
7: 100210 80 2159-05-25 20:08:00 2159-05-25 21:00:00 2 TRUE
8: 100210 89 2159-05-25 21:00:00 2159-05-25 21:30:00 2 TRUE
9: 100210 85 2159-05-25 21:30:00 2159-05-25 22:00:00 2 TRUE
10: 100210 99 2159-05-25 22:00:00 2159-05-25 22:30:00 2 TRUE
11: 100210 89 2159-05-25 22:30:00 2159-05-25 23:00:00 2 FALSE
12: 100210 102 2159-05-25 23:00:00 2159-05-25 23:00:00 3 FALSE
Regards
Upvotes: 2
Views: 134
Reputation: 30494
This is a revised answer using tidyverse
. The code has been modified to give a consistent result with @r2evans answer.
After grouping by your unique ID, add a column with the next row's time when VALUE
below 100.
Then, create subgroups within each ID, to separate runs of VALUE
less than 100 and those greater than 100.
Finally, for each of these subgroups, make below = TRUE
when the VALUE
is less than 100, and the difference in time between latest time for that subgroup and the row's time remains above 30 minutes.
library(tidyverse)
library(data.table)
df %>%
group_by(HADM_ID) %>%
mutate(time_next = if_else(VALUE >= 100, time, lead(time, default = last(time)))) %>%
group_by(grp = rleid(VALUE < 100), .add = TRUE) %>%
mutate(below = VALUE < 100 & difftime(max(time_next, na.rm = TRUE),
time,
units = "mins") > 30)
Output
HADM_ID VALUE time time_next grp below
<int> <int> <dttm> <dttm> <int> <lgl>
1 100210 92 2159-05-26 08:40:00 2159-05-26 09:00:00 1 TRUE
2 100210 98 2159-05-26 09:00:00 2159-05-26 09:12:00 1 FALSE
3 100210 105 2159-05-26 09:12:00 2159-05-26 09:12:00 2 FALSE
4 100889 92 2166-08-15 14:45:00 2166-08-15 15:00:00 3 FALSE
5 100889 98 2166-08-15 15:00:00 2166-08-15 15:15:00 3 FALSE
6 100889 101 2166-08-15 15:15:00 2166-08-15 15:15:00 4 FALSE
7 100520 89 2133-02-03 14:15:00 2133-02-03 15:15:00 5 TRUE
8 100520 102 2133-02-03 15:15:00 2133-02-03 15:15:00 6 FALSE
For second example dataset in question:
HADM_ID VALUE time time_diff grp below time_next
<int> <int> <dttm> <int> <int> <lgl> <dttm>
1 100210 112 2159-05-24 15:37:00 1380 1 FALSE 2159-05-24 15:37:00
2 100210 120 2159-05-24 16:00:00 3600 1 FALSE 2159-05-24 16:00:00
3 100210 121 2159-05-24 17:00:00 3600 1 FALSE 2159-05-24 17:00:00
4 100210 112 2159-05-24 18:00:00 3600 1 FALSE 2159-05-24 18:00:00
5 100210 106 2159-05-24 19:00:00 3600 1 FALSE 2159-05-24 19:00:00
6 100210 109 2159-05-24 20:00:00 86880 1 FALSE 2159-05-24 20:00:00
7 100210 80 2159-05-25 20:08:00 3120 2 TRUE 2159-05-25 21:00:00
8 100210 89 2159-05-25 21:00:00 1800 2 TRUE 2159-05-25 21:30:00
9 100210 85 2159-05-25 21:30:00 1800 2 FALSE 2159-05-25 22:00:00
10 100210 99 2159-05-25 22:00:00 1800 2 FALSE 2159-05-25 22:00:00
Upvotes: 2
Reputation: 160607
Try this data.table
solution:
library(data.table)
bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100889L, 100889L, 100889L, 100520L, 100520L), VALUE = c(92L, 98L, 105L, 92L, 98L, 101L, 89L, 102L), time = structure(c(5976852000, 5976853200, 5976853920, 6204797100, 6204798000, 6204798900, 5146744500, 5146748100), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -8L)))
bp[, timenext := shift(time, type = "lead", fill = time[.N]),
by = HADM_ID
][, grp := cumsum(VALUE >= 100),
by = .(HADM_ID)
][, BELOW := as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
by = .(HADM_ID, grp)
][, c("timenext", "grp") := NULL ]
# HADM_ID VALUE time BELOW
# <int> <int> <POSc> <lgcl>
# 1: 100210 92 2159-05-26 08:40:00 TRUE
# 2: 100210 98 2159-05-26 09:00:00 FALSE
# 3: 100210 105 2159-05-26 09:12:00 FALSE
# 4: 100889 92 2166-08-15 14:45:00 FALSE
# 5: 100889 98 2166-08-15 15:00:00 FALSE
# 6: 100889 101 2166-08-15 15:15:00 FALSE
# 7: 100520 89 2133-02-03 14:15:00 TRUE
# 8: 100520 102 2133-02-03 15:15:00 FALSE
Updated for new data:
bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L, 120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976704220, 5976705600, 5976709200, 5976712800, 5976716400, 5976720000, 5976806880, 5976810000, 5976811800, 5976813600), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -10L)))
bp[, timenext := fifelse(VALUE >= 100, time, shift(time, type = "lead", fill = time[.N])),
by = HADM_ID
][, grp := rleid(VALUE < 100), by = .(HADM_ID)
][, BELOW := VALUE < 100 & as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
by = .(HADM_ID, grp)
]
# HADM_ID VALUE time timenext grp BELOW
# <int> <int> <POSc> <POSc> <int> <lgcl>
# 1: 100210 112 2159-05-24 15:37:00 2159-05-24 15:37:00 1 FALSE
# 2: 100210 120 2159-05-24 16:00:00 2159-05-24 16:00:00 1 FALSE
# 3: 100210 121 2159-05-24 17:00:00 2159-05-24 17:00:00 1 FALSE
# 4: 100210 112 2159-05-24 18:00:00 2159-05-24 18:00:00 1 FALSE
# 5: 100210 106 2159-05-24 19:00:00 2159-05-24 19:00:00 1 FALSE
# 6: 100210 109 2159-05-24 20:00:00 2159-05-24 20:00:00 1 FALSE
# 7: 100210 80 2159-05-25 20:08:00 2159-05-25 21:00:00 2 TRUE
# 8: 100210 89 2159-05-25 21:00:00 2159-05-25 21:30:00 2 TRUE
# 9: 100210 85 2159-05-25 21:30:00 2159-05-25 22:00:00 2 FALSE
# 10: 100210 99 2159-05-25 22:00:00 2159-05-25 22:00:00 2 FALSE
Upvotes: 2