Chickenleg Joe
Chickenleg Joe

Reputation: 21

Check Values in columns for certain time span using timestamps from another column in dataframe

I have a df which consists of three columns, first one contains unique IDs, second one numeric Values and the last one date and time in POSIXct format. See below.

Ive tried to provide the first 10 observations using dput. I hope this works. Note that I had to delete the internal.selfref etc.

> dput(bp2s[1:10])
structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L, 
                                 100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L, 
                                                                                                  120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976682620, 
                                                                                                                                                                        5976684000, 5976687600, 5976691200, 5976694800, 5976698400, 5976785280, 
                                                                                                                                                                        5976788400, 5976790200, 5976792000), tzone = "", class = c("POSIXct", 
                                                                                                                                                                                                                                   "POSIXt"))), row.names = c(NA, -10L), class = c("data.table", 
                                                                                                                                                                                                                                                                                   "data.frame"))

Now I want to check each ID individually, whether at any point the VALUE stays below 100 for at least half an hour. The results should then be displayed in a new column ("BELOW") in Boolean with TRUE for < 100 for half an hour or more and FALSE if not. The Boolean value should be placed in the row of the starting point. The result should then look like this:

   HADM_ID VALUE                TIME BELOW
1:  100210    92 2159-05-26 08:40:00 TRUE
2:  100210    98 2159-05-26 09:00:00 FALSE
3:  100210   105 2159-05-26 09:12:00 FALSE
4:  100889    92 2166-08-15 14:50:00 FALSE
5:  100889    98 2166-08-15 15:00:00 FALSE
6:  100889   101 2166-08-15 15:15:00 FALSE
7:  100520    89 2133-02-03 14:15:00 TRUE
8:  100520   102 2133-02-03 15:15:00 FALSE

Using the UPDATED tidyverse proposal: Time diff is working and the first time the value stayed < 100 for 30 minutes+ is detected correctly(observation 7). Note, however, the next two observations(8 & 9) are still < 100 and stay < 100 and yet they are marked as FALSE.

 HADM_ID VALUE time                time_diff   grp below
     <int> <int> <dttm>              <drtn>    <int> <lgl>
 1  100210   112 2159-05-24 15:37:00   23 mins     1 FALSE
 2  100210   120 2159-05-24 16:00:00   60 mins     2 FALSE
 3  100210   121 2159-05-24 17:00:00   60 mins     3 FALSE
 4  100210   112 2159-05-24 18:00:00   60 mins     4 FALSE
 5  100210   106 2159-05-24 19:00:00   60 mins     5 FALSE
 6  100210   109 2159-05-24 20:00:00 1448 mins     6 FALSE
 7  100210    80 2159-05-25 20:08:00   52 mins     6 TRUE 
 8  100210    89 2159-05-25 21:00:00   30 mins     6 FALSE #should be TRUE since Value stays under 100 for next 30 min
 9  100210    85 2159-05-25 21:30:00   30 mins     6 FALSE #should be TRUE see above 
10  100210    99 2159-05-25 22:00:00   30 mins     6 FALSE

Using the UPDATED data.table solution I get the following result which looks spot on.

HADM_ID VALUE                time            timenext grp BELOW
 1:  100210   112 2159-05-24 15:37:00 2159-05-24 15:37:00   1 FALSE
 2:  100210   120 2159-05-24 16:00:00 2159-05-24 16:00:00   1 FALSE
 3:  100210   121 2159-05-24 17:00:00 2159-05-24 17:00:00   1 FALSE
 4:  100210   112 2159-05-24 18:00:00 2159-05-24 18:00:00   1 FALSE
 5:  100210   106 2159-05-24 19:00:00 2159-05-24 19:00:00   1 FALSE
 6:  100210   109 2159-05-24 20:00:00 2159-05-24 20:00:00   1 FALSE
 7:  100210    80 2159-05-25 20:08:00 2159-05-25 21:00:00   2  TRUE
 8:  100210    89 2159-05-25 21:00:00 2159-05-25 21:30:00   2  TRUE
 9:  100210    85 2159-05-25 21:30:00 2159-05-25 22:00:00   2  TRUE
10:  100210    99 2159-05-25 22:00:00 2159-05-25 22:30:00   2  TRUE
11:  100210    89 2159-05-25 22:30:00 2159-05-25 23:00:00   2 FALSE
12:  100210   102 2159-05-25 23:00:00 2159-05-25 23:00:00   3 FALSE

Regards

Upvotes: 2

Views: 134

Answers (2)

Ben
Ben

Reputation: 30494

This is a revised answer using tidyverse. The code has been modified to give a consistent result with @r2evans answer.

After grouping by your unique ID, add a column with the next row's time when VALUE below 100.

Then, create subgroups within each ID, to separate runs of VALUE less than 100 and those greater than 100.

Finally, for each of these subgroups, make below = TRUE when the VALUE is less than 100, and the difference in time between latest time for that subgroup and the row's time remains above 30 minutes.

library(tidyverse)
library(data.table)

df %>%
  group_by(HADM_ID) %>%
  mutate(time_next = if_else(VALUE >= 100, time, lead(time, default = last(time)))) %>%
  group_by(grp = rleid(VALUE < 100), .add = TRUE) %>%
  mutate(below = VALUE < 100 & difftime(max(time_next, na.rm = TRUE), 
                                        time, 
                                        units = "mins") > 30)

Output

  HADM_ID VALUE time                time_next             grp below
    <int> <int> <dttm>              <dttm>              <int> <lgl>
1  100210    92 2159-05-26 08:40:00 2159-05-26 09:00:00     1 TRUE 
2  100210    98 2159-05-26 09:00:00 2159-05-26 09:12:00     1 FALSE
3  100210   105 2159-05-26 09:12:00 2159-05-26 09:12:00     2 FALSE
4  100889    92 2166-08-15 14:45:00 2166-08-15 15:00:00     3 FALSE
5  100889    98 2166-08-15 15:00:00 2166-08-15 15:15:00     3 FALSE
6  100889   101 2166-08-15 15:15:00 2166-08-15 15:15:00     4 FALSE
7  100520    89 2133-02-03 14:15:00 2133-02-03 15:15:00     5 TRUE 
8  100520   102 2133-02-03 15:15:00 2133-02-03 15:15:00     6 FALSE

For second example dataset in question:

   HADM_ID VALUE time                time_diff   grp below time_next          
     <int> <int> <dttm>                  <int> <int> <lgl> <dttm>             
 1  100210   112 2159-05-24 15:37:00      1380     1 FALSE 2159-05-24 15:37:00
 2  100210   120 2159-05-24 16:00:00      3600     1 FALSE 2159-05-24 16:00:00
 3  100210   121 2159-05-24 17:00:00      3600     1 FALSE 2159-05-24 17:00:00
 4  100210   112 2159-05-24 18:00:00      3600     1 FALSE 2159-05-24 18:00:00
 5  100210   106 2159-05-24 19:00:00      3600     1 FALSE 2159-05-24 19:00:00
 6  100210   109 2159-05-24 20:00:00     86880     1 FALSE 2159-05-24 20:00:00
 7  100210    80 2159-05-25 20:08:00      3120     2 TRUE  2159-05-25 21:00:00
 8  100210    89 2159-05-25 21:00:00      1800     2 TRUE  2159-05-25 21:30:00
 9  100210    85 2159-05-25 21:30:00      1800     2 FALSE 2159-05-25 22:00:00
10  100210    99 2159-05-25 22:00:00      1800     2 FALSE 2159-05-25 22:00:00

Upvotes: 2

r2evans
r2evans

Reputation: 160607

Try this data.table solution:

library(data.table)
bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100889L, 100889L, 100889L, 100520L, 100520L), VALUE = c(92L, 98L, 105L, 92L, 98L, 101L, 89L, 102L), time = structure(c(5976852000, 5976853200, 5976853920, 6204797100, 6204798000, 6204798900, 5146744500, 5146748100), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -8L)))

bp[, timenext := shift(time, type = "lead", fill = time[.N]),
   by = HADM_ID
  ][, grp := cumsum(VALUE >= 100),
    by = .(HADM_ID)
  ][, BELOW := as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
    by = .(HADM_ID, grp)
  ][, c("timenext", "grp") := NULL ]
#    HADM_ID VALUE                time  BELOW
#      <int> <int>              <POSc> <lgcl>
# 1:  100210    92 2159-05-26 08:40:00   TRUE
# 2:  100210    98 2159-05-26 09:00:00  FALSE
# 3:  100210   105 2159-05-26 09:12:00  FALSE
# 4:  100889    92 2166-08-15 14:45:00  FALSE
# 5:  100889    98 2166-08-15 15:00:00  FALSE
# 6:  100889   101 2166-08-15 15:15:00  FALSE
# 7:  100520    89 2133-02-03 14:15:00   TRUE
# 8:  100520   102 2133-02-03 15:15:00  FALSE

Updated for new data:

bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L, 120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976704220, 5976705600, 5976709200, 5976712800, 5976716400, 5976720000, 5976806880, 5976810000, 5976811800, 5976813600), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -10L)))

bp[, timenext := fifelse(VALUE >= 100, time, shift(time, type = "lead", fill = time[.N])),
   by = HADM_ID
  ][, grp := rleid(VALUE < 100), by = .(HADM_ID)
  ][, BELOW := VALUE < 100 & as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
    by = .(HADM_ID, grp)
  ]
#     HADM_ID VALUE                time            timenext   grp  BELOW
#       <int> <int>              <POSc>              <POSc> <int> <lgcl>
#  1:  100210   112 2159-05-24 15:37:00 2159-05-24 15:37:00     1  FALSE
#  2:  100210   120 2159-05-24 16:00:00 2159-05-24 16:00:00     1  FALSE
#  3:  100210   121 2159-05-24 17:00:00 2159-05-24 17:00:00     1  FALSE
#  4:  100210   112 2159-05-24 18:00:00 2159-05-24 18:00:00     1  FALSE
#  5:  100210   106 2159-05-24 19:00:00 2159-05-24 19:00:00     1  FALSE
#  6:  100210   109 2159-05-24 20:00:00 2159-05-24 20:00:00     1  FALSE
#  7:  100210    80 2159-05-25 20:08:00 2159-05-25 21:00:00     2   TRUE
#  8:  100210    89 2159-05-25 21:00:00 2159-05-25 21:30:00     2   TRUE
#  9:  100210    85 2159-05-25 21:30:00 2159-05-25 22:00:00     2  FALSE
# 10:  100210    99 2159-05-25 22:00:00 2159-05-25 22:00:00     2  FALSE

Upvotes: 2

Related Questions