Reputation: 1163
I am trying to transform a dataframe df
consisting of a value
-column, two date columns (start
and end
), and a interval-column (duration
) into the long format by unnesting/unstacking the duration
column.
library(dplyr)
library(lubridate)
df <- data.frame(value = letters[1:3], start = as_date(1:3), end = as_date(3:1)+3) %>%
mutate(duration = interval(start, end))
The expected outcome would a dataframe where value
, start
, and end
are duplicated for each day as defined by duration
. For instance, the value 'a' would appear 6 times each time on a different day (2nd, 3rd, 4th, 5th, 6th, 7th January 1970).
I tried to use the unnest
function from the tidyr
package but nothing happened.
tidyr::unnest(df, duration)
Any help is greatly appreciated :)
Upvotes: 4
Views: 136
Reputation: 26218
with uncount
df %>% uncount(as.integer(duration/(24*60*60)) +1) %>%
group_by(value) %>%
mutate(date = row_number() -1 + start)
# A tibble: 12 x 5
# Groups: value [3]
value start end duration date
<chr> <date> <date> <Interval> <date>
1 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-02
2 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-03
3 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-04
4 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-05
5 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-06
6 a 1970-01-02 1970-01-07 1970-01-02 UTC--1970-01-07 UTC 1970-01-07
7 b 1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-03
8 b 1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-04
9 b 1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-05
10 b 1970-01-03 1970-01-06 1970-01-03 UTC--1970-01-06 UTC 1970-01-06
11 c 1970-01-04 1970-01-05 1970-01-04 UTC--1970-01-05 UTC 1970-01-04
12 c 1970-01-04 1970-01-05 1970-01-04 UTC--1970-01-05 UTC 1970-01-05
Upvotes: 2
Reputation: 21908
You can also use the following solution. Since we are going to create duplicate rows we can wrap the operation in a list and then use unnest_longer
. purrr
package functions have always been my first choice but you can also use this as an alternative.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
group_by(value) %>%
mutate(date = list(start + 0:(duration/ddays(1)))) %>%
unnest_longer(date) %>%
select(-duration)
# A tibble: 12 x 4
# Groups: value [3]
value start end date
<chr> <date> <date> <date>
1 a 1970-01-02 1970-01-07 1970-01-02
2 a 1970-01-02 1970-01-07 1970-01-03
3 a 1970-01-02 1970-01-07 1970-01-04
4 a 1970-01-02 1970-01-07 1970-01-05
5 a 1970-01-02 1970-01-07 1970-01-06
6 a 1970-01-02 1970-01-07 1970-01-07
7 b 1970-01-03 1970-01-06 1970-01-03
8 b 1970-01-03 1970-01-06 1970-01-04
9 b 1970-01-03 1970-01-06 1970-01-05
10 b 1970-01-03 1970-01-06 1970-01-06
11 c 1970-01-04 1970-01-05 1970-01-04
12 c 1970-01-04 1970-01-05 1970-01-05
Upvotes: 3
Reputation: 27732
A data.table
approach
library(data.table)
setDT(df)[, .(date = seq(start, end, by = 1)), by = .(value)]
# value date
# 1: a 1970-01-02
# 2: a 1970-01-03
# 3: a 1970-01-04
# 4: a 1970-01-05
# 5: a 1970-01-06
# 6: a 1970-01-07
# 7: b 1970-01-03
# 8: b 1970-01-04
# 9: b 1970-01-05
#10: b 1970-01-06
#11: c 1970-01-04
#12: c 1970-01-05
Upvotes: 1
Reputation: 2164
You can't unstack a column of intervals and expect it to generate all dates in between, but by using seq
you can generate them yourself. Try this:
library(tidyverse)
library(lubridate)
df %>%
rowwise() %>%
summarise(
value, dates = seq(start, end, by = 1)
)
#> # A tibble: 12 x 2
#> value dates
#> <chr> <date>
#> 1 a 1970-01-02
#> 2 a 1970-01-03
#> 3 a 1970-01-04
#> 4 a 1970-01-05
#> 5 a 1970-01-06
#> 6 a 1970-01-07
#> 7 b 1970-01-03
#> 8 b 1970-01-04
#> 9 b 1970-01-05
#> 10 b 1970-01-06
#> 11 c 1970-01-04
#> 12 c 1970-01-05
Created on 2021-05-18 by the reprex package (v1.0.0)
Upvotes: 3
Reputation: 388797
To extract the start and end date from interval you can use int_start
and int_end
, create a sequence of dates with map2
and unnest
.
library(dplyr)
library(purrr)
library(tidyr)
library(lubridate)
df %>%
mutate(date = map2(int_start(duration), int_end(duration),
~seq(as.Date(.x), as.Date(.y), by = 'day'))) %>%
#This will also work but would return date of class POSIXct
#mutate(date = map2(int_start(duration), int_end(duration),seq,by = 'day')) %>%
unnest(date) %>%
select(-duration)
# value start end date
# <chr> <date> <date> <date>
# 1 a 1970-01-02 1970-01-07 1970-01-02
# 2 a 1970-01-02 1970-01-07 1970-01-03
# 3 a 1970-01-02 1970-01-07 1970-01-04
# 4 a 1970-01-02 1970-01-07 1970-01-05
# 5 a 1970-01-02 1970-01-07 1970-01-06
# 6 a 1970-01-02 1970-01-07 1970-01-07
# 7 b 1970-01-03 1970-01-06 1970-01-03
# 8 b 1970-01-03 1970-01-06 1970-01-04
# 9 b 1970-01-03 1970-01-06 1970-01-05
#10 b 1970-01-03 1970-01-06 1970-01-06
#11 c 1970-01-04 1970-01-05 1970-01-04
#12 c 1970-01-04 1970-01-05 1970-01-05
Upvotes: 2
Reputation: 18425
I don't think interval
is helping here - seq.Date
might be better...
library(purrr) #as well as those you have
df <- data.frame(value = letters[1:3], start = as_date(1:3), end = as_date(3:1)+3) %>%
mutate(day = map2(start, end, seq.Date, by = "day")) %>%
unnest(day)
df
# A tibble: 12 x 4
value start end day
<chr> <date> <date> <date>
1 a 1970-01-02 1970-01-07 1970-01-02
2 a 1970-01-02 1970-01-07 1970-01-03
3 a 1970-01-02 1970-01-07 1970-01-04
4 a 1970-01-02 1970-01-07 1970-01-05
5 a 1970-01-02 1970-01-07 1970-01-06
6 a 1970-01-02 1970-01-07 1970-01-07
7 b 1970-01-03 1970-01-06 1970-01-03
8 b 1970-01-03 1970-01-06 1970-01-04
9 b 1970-01-03 1970-01-06 1970-01-05
10 b 1970-01-03 1970-01-06 1970-01-06
11 c 1970-01-04 1970-01-05 1970-01-04
12 c 1970-01-04 1970-01-05 1970-01-05
Upvotes: 2