Reputation: 73
I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
...
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
...
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
library(data.table)
data <- data.table(dat)
data[,:=
(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.
Upvotes: 0
Views: 1135
Reputation: 1982
Using a tidyverse solution could look like:
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
'2013-09-27'),
end_date = c('2010-10-12',
'2015-10-11'),
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
end_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.
Upvotes: 1
Reputation: 79338
You could do the following:
library(lubridate)
library(tidyverse)
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
unnest(year)%>%
select(place,year,value)
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7
Upvotes: 1