Taylor Spaulding
Taylor Spaulding

Reputation: 169

How to split event intervals by month with dplyr

I have a set of events identified by an individual "Tag" their location "location" the Start of the event "StartDateTime_UTC" and the End of the event "EndDateTime_UTC"

First 50 rows: (I'll place the full dput down below)

   Tag     location   StartDateTime_UTC   EndDateTime_UTC    
   <chr>   <fct>      <dttm>              <dttm>             
 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19
 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31
 4 5004.24 Unresolved 2014-11-04 10:05:31 2014-11-04 10:08:06
 5 5004.24 IN         2014-11-04 10:08:06 2014-11-12 10:50:28
 6 5004.24 Unresolved 2014-11-12 10:50:28 2014-11-12 10:51:33
 7 5004.24 OUT        2014-11-12 10:51:33 2014-12-24 04:20:28
 8 5004.24 Unresolved 2014-12-24 04:20:28 2014-12-24 04:31:27
 9 5004.24 IN         2014-12-24 04:31:27 2015-02-08 22:30:27
10 5004.24 Unresolved 2015-02-08 22:30:27 2015-02-08 22:31:31
11 5004.24 OUT        2015-02-08 22:31:31 2015-03-01 11:18:32
12 5004.24 Unresolved 2015-03-01 11:18:32 2015-03-01 19:34:36
13 5004.24 IN         2015-03-01 19:34:36 2015-03-09 10:11:55
14 5004.24 Unresolved 2015-03-09 10:11:55 2015-03-09 10:48:02
15 5004.24 OUT        2015-03-09 10:48:02 2015-03-23 09:10:25
16 5004.24 Unresolved 2015-03-23 09:10:25 2015-03-23 09:18:07
17 5004.24 IN         2015-03-23 09:18:07 2015-04-09 23:19:26
18 5004.24 Unresolved 2015-04-09 23:19:26 2015-04-09 23:21:40
19 5004.24 OUT        2015-04-09 23:21:40 2015-04-21 20:11:59
20 5004.24 Unresolved 2015-04-21 20:11:59 2015-04-22 16:33:54
21 5004.24 IN         2015-04-22 16:33:54 2015-06-10 09:19:12
22 5010.04 IN         2014-05-09 18:26:00 2015-04-21 18:28:16
23 5011.03 IN         2016-06-17 15:29:00 2016-07-21 14:23:34
24 5011.07 IN         2014-06-05 17:58:00 2014-12-08 15:15:07
25 5017.06 IN         2013-09-20 03:00:00 2016-08-18 18:13:04
26 5025.22 IN         2013-06-11 14:20:00 2013-06-11 14:20:01
27 5025.26 IN         2016-12-29 22:38:00 2018-01-08 16:51:42
28 5032.24 IN         2014-07-18 18:04:00 2015-06-13 12:44:02
29 5038.04 IN         2014-12-16 17:51:07 2015-04-10 23:28:12
30 5039.03 IN         2016-04-01 17:26:00 2016-07-02 04:39:15
31 5039.07 IN         2014-04-08 19:28:04 2014-04-08 19:53:06
32 5045    IN         2013-08-07 17:00:00 2015-04-08 18:28:43
33 5053.26 IN         2016-11-14 18:12:00 2017-12-24 20:21:31
34 5066.04 IN         2014-02-25 21:32:00 2014-02-25 21:32:01
35 5067.07 IN         2014-10-02 16:39:00 2014-12-08 16:23:50
36 5073.06 IN         2013-05-22 14:45:00 2013-05-22 14:45:01
37 5074.16 IN         2016-02-11 19:02:00 2016-03-09 18:21:45
38 5074.16 Unresolved 2016-03-09 18:21:45 2016-03-09 19:50:22
39 5074.16 OUT        2016-03-09 19:50:22 2016-03-12 23:26:45
40 5088.11 IN         2015-06-02 16:29:00 2015-08-08 11:07:46
41 5094.04 IN         2014-10-17 16:04:00 2014-10-31 16:39:28
42 5101.06 IN         2013-09-20 03:00:00 2013-09-20 03:00:01
43 5101.06 Unresolved 2013-09-20 03:00:01 2014-02-03 07:15:08
44 5101.06 OUT        2014-02-03 07:15:08 2014-04-17 12:15:34
45 5116.24 IN         2014-03-07 20:59:00 2015-03-07 22:21:52
46 5123.03 IN         2016-01-14 18:55:00 2016-03-09 19:21:20
47 5123.03 Unresolved 2016-03-09 19:21:20 2016-03-09 20:24:36
48 5123.03 OUT        2016-03-09 20:24:36 2016-03-09 21:51:51
49 5123.07 IN         2014-01-21 18:54:00 2014-05-20 15:23:09
50 5150.04 IN         2014-05-22 16:17:00 2016-03-07 14:10:00

What I'd like to do is divide the total time for each event into how much time each individual spent at each location (IN, OUT, or Unresolved), for each month in each year.

For example the first 3 rows above would become:

Tag        location    Month    Year   Duration_hrs
5004.24    IN          9        2014   18.50
5004.24    IN          10       2014   328.20
5004.24    Unresolved  10       2014   .06
5004.24    OUT         10       2014   345.74
5004.24    OUT         11       2014   82.09

I'm not familiar with a method for this yet. Is there a way to do this in dplyr or another package which would be useful?

dput:

structure(list(Tag = c("5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5004.24", 
"5004.24", "5004.24", "5004.24", "5004.24", "5004.24", "5010.04", 
"5011.03", "5011.07", "5017.06", "5025.22", "5025.26", "5032.24", 
"5038.04", "5039.03", "5039.07", "5045", "5053.26", "5066.04", 
"5067.07", "5073.06", "5074.16", "5074.16", "5074.16", "5088.11", 
"5094.04", "5101.06", "5101.06", "5101.06", "5116.24", "5123.03", 
"5123.03", "5123.03", "5123.07", "5150.04", "5157.06", "5158.16", 
"5165.22", "5172.24", "5178.31", "5179.03", "5179.07", "5186.16", 
"5186.16", "5186.16", "5186.16", "5186.16", "5186.16", "5186.16", 
"5186.16", "5186.16", "5186.16", "5186.16", "5186.16", "5186.16", 
"5186.16", "5186.16", "5200.11", "5206.31", "5214.16", "5228.24", 
"5235", "5242.16", "5242.16", "5242.16", "5242.16", "5242.16", 
"5249", "5249", "5249", "5256.24", "5256.24", "5256.24", "5256.24", 
"5256.24", "5256.24", "5256.24", "5263.03", "5263.07", "5270.16", 
"5284.11", "5290.31", "5290.31", "5290.31", "5290.31", "5298.16", 
"5298.16", "5298.16", "5298.16", "5298.16", "5312.11", "5318.04", 
"5325.06", "5326.16", "5326.16", "5326.16", "5326.16", "5326.16", 
"5326.16", "5340.11", "5346.31", "5347.07", "5347.07", "5347.07", 
"5347.07", "5347.07", "5347.07", "5347.07", "5347.07", "5347.07", 
"5347.07", "5347.07", "5347.07", "5347.07", "5361.26", "5368.24", 
"5374.04", "5374.04", "5374.04", "5374.04", "5374.04", "5375.03", 
"5375.07", "5381.06", "5402.31", "5402.31", "5402.31", "5403.07", 
"5403.07", "5403.07", "5403.07", "5403.07", "5403.07", "5403.07", 
"5431.07", "5438.16", "5445.26", "5445.26", "5445.26", "5452.11", 
"5465.06", "5465.06", "5465.06", "5480.24", "5487.03", "5487.03", 
"5487.03", "5487.07", "5487.07", "5487.07", "5487.07", "5487.07", 
"5487.07", "5487.07", "5487.07", "5487.07", "5487.07", "5487.07", 
"5493.06", "5493.06", "5493.06", "5493.06", "5493.06", "5493.06", 
"5493.06", "5493.06", "5493.06", "5501.22", "5508.11", "5514.31", 
"5536.11", "5542.31", "5542.31", "5542.31", "5550.16", "5557.22", 
"5564.24", "5570.04", "5571.03", "5571.07", "5577", "5585.26", 
"5592.11", "5599.15", "5599.15"), location = structure(c(1L, 
3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 
3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 2L, 3L, 1L, 
3L, 2L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 
1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 
2L, 3L, 1L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 
2L, 1L, 1L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 
1L, 1L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 
3L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 3L, 
2L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 
1L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("IN", "OUT", "Unresolved"
), class = "factor"), StartDateTime_UTC = structure(c(1412055000, 
1413555139, 1413555343, 1415095531, 1415095686, 1415789428, 1415789493, 
1419394828, 1419395487, 1423434627, 1423434691, 1425208712, 1425238476, 
1425895915, 1425898082, 1427101825, 1427102287, 1428621566, 1428621700, 
1429647119, 1429720434, 1399659960, 1466177340, 1401991080, 1379646000, 
1370960400, 1483051080, 1405706640, 1418752267, 1459531560, 1396985284, 
1375894800, 1479147120, 1393363920, 1412267940, 1369233900, 1455217320, 
1457547705, 1457553022, 1433262540, 1413561840, 1379646000, 1379646001, 
1391411708, 1394225940, 1452797700, 1457551280, 1457555076, 1390330440, 
1400775420, 1373994000, 1477593060, 1374166980, 1401299940, 1447954320, 
1456854360, 1393888860, 1472234940, 1489406928, 1489519785, 1493754358, 
1493768930, 1493920595, 1495185477, 1495192085, 1496331260, 1496332668, 
1502330689, 1504055913, 1504056565, 1504471969, 1504484502, 1446829680, 
1450458660, 1478808060, 1391558460, 1375894800, 1483040340, 1489421085, 
1489524492, 1500821550, 1500840149, 1379430000, 1397144701, 1397145001, 
1418064072, 1425499236, 1425499420, 1426550367, 1426705419, 1456257069, 
1456260660, 1471541820, 1407518280, 1477072440, 1443805740, 1425491220, 
1426570277, 1445317719, 1445333300, 1452196320, 1464295591, 1464296126, 
1465051334, 1465057497, 1438967700, 1398371160, 1380208260, 1472238360, 
1473868480, 1473870631, 1500912421, 1500916070, 1501200103, 1450372080, 
1429632420, 1417450780, 1425319859, 1425319875, 1425476848, 1425477093, 
1425490129, 1425490424, 1425497300, 1425497428, 1425497915, 1425499161, 
1426058600, 1426059054, 1458149820, 1416942495, 1399653120, 1411058361, 
1411058655, 1425581564, 1425582554, 1463157660, 1399655460, 1373994000, 
1423859700, 1426434141, 1426438127, 1418054706, 1423434226, 1423434240, 
1424431972, 1424432003, 1425225584, 1425226167, 1413566460, 1463156700, 
1457457262, 1464341296, 1464345439, 1438964700, 1385483040, 1392229993, 
1392230720, 1397856480, 1454698803, 1457554029, 1457557132, 1392415980, 
1394995835, 1394996053, 1394999500, 1394999554, 1398891904, 1398892060, 
1400546392, 1410656812, 1410657169, 1412681302, 1383667200, 1394063203, 
1394064827, 1394534414, 1394546963, 1394903943, 1394904559, 1404290854, 
1404295213, 1374166980, 1450460220, 1444934640, 1443814080, 1423862520, 
1425398953, 1425403026, 1482517054, 1382107500, 1406303160, 1401292020, 
1459530420, 1397852580, 1375806600, 1459355400, 1436547420, 1434128700, 
1455181152), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    EndDateTime_UTC = structure(c(1413555139, 1413555343, 1415095531, 
    1415095686, 1415789428, 1415789493, 1419394828, 1419395487, 
    1423434627, 1423434691, 1425208712, 1425238476, 1425895915, 
    1425898082, 1427101825, 1427102287, 1428621566, 1428621700, 
    1429647119, 1429720434, 1433927952, 1429640896, 1469111014, 
    1418051707, 1471543984, 1370960401, 1515430302, 1434199442, 
    1428708492, 1467434355, 1396986786, 1428517723, 1514146891, 
    1393363921, 1418055830, 1369233901, 1457547705, 1457553022, 
    1457825205, 1439032066, 1414773568, 1379646001, 1391411708, 
    1397736934, 1425766912, 1457551280, 1457555076, 1457560311, 
    1400599389, 1457359800, 1499495745, 1477593061, 1374166981, 
    1401396204, 1447954321, 1462853353, 1402068722, 1489406928, 
    1489519785, 1493754358, 1493768930, 1493915356, 1495185477, 
    1495192085, 1496331260, 1496332668, 1502328647, 1504055913, 
    1504056565, 1504471969, 1504484502, 1510842039, 1449871257, 
    1467438121, 1489417066, 1399047138, 1402529845, 1489421085, 
    1489524492, 1500821550, 1500840149, 1501331232, 1397144701, 
    1397145001, 1397146277, 1425499236, 1425499420, 1426550367, 
    1426550808, 1456257069, 1456260660, 1457306762, 1471541821, 
    1407522783, 1489406240, 1447974368, 1426563298, 1445317719, 
    1445333300, 1460371267, 1464295591, 1464296126, 1465051334, 
    1465057497, 1475823163, 1438969110, 1429569640, 1380208261, 
    1473868480, 1473870631, 1500912421, 1500916070, 1501192164, 
    1522212002, 1455503331, 1429632421, 1425319859, 1425319875, 
    1425476848, 1425477093, 1425490129, 1425490424, 1425497300, 
    1425497428, 1425497915, 1425499161, 1426058600, 1426059054, 
    1435973349, 1518286934, 1424550436, 1411058361, 1411058655, 
    1425581564, 1425582554, 1448200301, 1467374409, 1411930669, 
    1485696012, 1426434141, 1426438127, 1426439051, 1423434226, 
    1423434240, 1424431972, 1424432003, 1425225584, 1425226167, 
    1426387273, 1414861317, 1465669005, 1464341296, 1464345439, 
    1464345554, 1438964701, 1392229993, 1392230720, 1402254843, 
    1403572223, 1457554029, 1457557132, 1457575616, 1394995835, 
    1394996053, 1394999500, 1394999554, 1398891904, 1398892060, 
    1400546392, 1410656812, 1410657169, 1412681302, 1412892331, 
    1394063203, 1394064827, 1394534414, 1394546963, 1394903943, 
    1394904559, 1404290854, 1404295213, 1414516048, 1374166981, 
    1450460221, 1445287124, 1451876530, 1425398953, 1425403026, 
    1433699058, 1503067936, 1394903394, 1406514937, 1411706590, 
    1460054294, 1413725736, 1375806601, 1485062993, 1448650763, 
    1455181152, 1455181271), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))

Upvotes: 1

Views: 524

Answers (1)

Mikko Marttila
Mikko Marttila

Reputation: 11908

In these situations, I like to write a little helper function that takes a start and an end time, and splits that into a number of smaller periods at a given unit size:

library(tidyverse)
library(lubridate)

split_interval <- function(start, end, unit) {
  breaks <- seq(floor_date(start, unit), ceiling_date(end, unit), by = unit)
  timeline <- c(start, breaks[breaks > start & breaks < end], end)
  tibble(.start = head(timeline, -1), .end = tail(timeline, -1))
}

split_interval(now(), now() + hours(2), unit = "hours")
#> # A tibble: 3 x 2
#>   .start              .end               
#>   <dttm>              <dttm>             
#> 1 2019-11-08 23:33:18 2019-11-09 00:00:00
#> 2 2019-11-09 00:00:00 2019-11-09 01:00:00
#> 3 2019-11-09 01:00:00 2019-11-09 01:33:18

With this in hand, we can split the intervals on each row, giving a list column containing data frames of periods that each row falls into:

tbl_split <- tbl %>% 
  mutate(periods = map2(StartDateTime_UTC, EndDateTime_UTC,
                        split_interval, unit = "months"))

tbl_split
#> # A tibble: 3 x 5
#>   Tag     location   StartDateTime_UTC   EndDateTime_UTC     periods       
#>   <chr>   <fct>      <dttm>              <dttm>              <list>        
#> 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19 <tibble [2 x ~
#> 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43 <tibble [1 x ~
#> 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31 <tibble [2 x ~

(I'm only using the first three rows in tbl to keep the output small. Code at the end.)

Then, we can unnest() to create a longer data frame with each original row now split into multiple rows of shorter periods:

tbl_split_long <- tbl_split %>% unnest(periods)

tbl_split_long %>% 
  select(location, .start, .end)
#> # A tibble: 5 x 3
#>   location   .start              .end               
#>   <fct>      <dttm>              <dttm>             
#> 1 IN         2014-09-30 05:30:00 2014-10-01 00:00:00
#> 2 IN         2014-10-01 00:00:00 2014-10-17 14:12:19
#> 3 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
#> 4 OUT        2014-10-17 14:15:43 2014-11-01 00:00:00
#> 5 OUT        2014-11-01 00:00:00 2014-11-04 10:05:31

Now we can use the start of each period to determine the year and month, and then it’s just a matter of a groupped summary to get the end result:

tbl_split_long %>% 
  group_by(Tag, location, year = year(.start), month = month(.start)) %>% 
  summarize(duration = sum(difftime(.end, .start, units = "hours")))
#> # A tibble: 5 x 5
#> # Groups:   Tag, location, year [3]
#>   Tag     location    year month duration          
#>   <chr>   <fct>      <dbl> <dbl> <drtn>            
#> 1 5004.24 IN          2014     9  18.50000000 hours
#> 2 5004.24 IN          2014    10 398.20527778 hours
#> 3 5004.24 OUT         2014    10 345.73805556 hours
#> 4 5004.24 OUT         2014    11  82.09194444 hours
#> 5 5004.24 Unresolved  2014    10   0.05666667 hours

Data:

tbl <- structure(
  list(
    Tag = c("5004.24", "5004.24", "5004.24"),
    location = structure(
      c(1L, 3L, 2L),
      .Label = c("IN", "OUT", "Unresolved"),
      class = "factor"
    ),
    StartDateTime_UTC = structure(
      c(1412055000, 1413555139, 1413555343),
      class = c("POSIXct", "POSIXt"),
      tzone = "UTC"
    ),
    EndDateTime_UTC = structure(
      c(1413555139, 1413555343, 1415095531),
      class = c("POSIXct", "POSIXt"),
      tzone = "UTC"
    )
  ),
  row.names = c(NA, 3L),
  class = c("tbl_df", "tbl", "data.frame")
)

Upvotes: 2

Related Questions