Reputation: 21
I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!
Current format:
New format:
CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)
Upvotes: 2
Views: 110
Reputation: 30494
Here is one approach.
First, create date intervals for each location from start to end dates. Using map2
and unnest
you will create additional rows for each year.
Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE
to END_DATE
. For each year, you can filter
by max(WEEKS)
(or to ensure a single address per year, arrange in descending order by WEEKS and slice(1)
--- or with latest tidyr
consider slice_max
). This will keep the row where there is the greatest number of weeks duration overlap between intervals.
The final complete
will ensure you have rows for all years between 2013-2018.
library(tidyverse)
library(lubridate)
df %>%
mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
unnest(YEAR) %>%
mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
group_by(ID, YEAR) %>%
arrange(desc(WEEKS)) %>%
slice(1) %>%
group_by(ID) %>%
complete(YEAR = seq(2013, 2018, 1)) %>%
arrange(ID, YEAR) %>%
select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))
Output
# A tibble: 14 x 4
# Groups: ID [2]
ID YEAR ZIPCODE CITY
<dbl> <dbl> <chr> <chr>
1 1 2013 NA NA
2 1 2014 1234AB NEWYORK
3 1 2015 1234AB NEWYORK
4 1 2016 1234AB NEWYORK
5 1 2017 5678CD LA
6 1 2018 5678CD LA
7 2 2011 9012EF MIAMI
8 2 2012 9012EF MIAMI
9 2 2013 9012EF MIAMI
10 2 2014 9012EF MIAMI
11 2 2015 9012EF MIAMI
12 2 2016 9012EF MIAMI
13 2 2017 9012EF MIAMI
14 2 2018 NA NA
Data
df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD",
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072,
17238, 15288), class = "Date"), END_DATE = structure(c(17299,
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 2