Reputation: 87

Is there a way to create dummy variables for years that fall between two time points?

I am working with some time series data, where each row is an observation of a person, and I have two time periods, the start date and the end date. I am trying to create dummy variables for each year, such that if the year falls between the start date and the end date, the dummy is coded as 1.

The end result is to use this for data visualization purposes on demographics by year.

I've looked at some packages, but it seems to create dummies from variables already provided. Since some of the years may be missing from one of the columns, I'm trying to find an alternate option.

id <- c(1:3)
start.date <- c(1990, 1850, 1910)
end.date <- c(2014, 1920, 1980)

df <- data.frame(id, start.date, end.date)

df

As you can see from the structure of the data, I would like individual 1, for instance, to have the dummies coded between 1990 and 2014 as 1, and 0 otherwise.

Upvotes: 3

Answers (3)

Cole

Reputation: 11255

Here's a base way to do it:

seqs <- df$end.date - df$start.date+1

table(data.frame(id = rep(df[['id']], seqs)
           , Y = paste0('Y', rep(df[['start.date']], seqs) + sequence(seqs) - 1)))

   Y
id  Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858 Y1859 ... lots of columns
  1     0     0     0     0     0     0     0     0     0     0
  2     1     1     1     1     1     1     1     1     1     1
  3     0     0     0     0     0     0     0     0     0     0

Upvotes: 0

M--

Reputation: 28955

Another method of doing this using tidyr::expand instead of dplyr::transmute.

df1 <- data.frame(id = c(1:3),
                  start.date = c(1990, 1850, 1910),
                  end.date = c(2014, 1920, 1980))

library(dplyr)
library(tidyr)

df1 %>%
  group_by(id) %>% 
  expand(year = start.date:end.date) %>% 
  mutate(value = 1, year = paste0("Y",year)) %>% 
  ungroup %>% 
  spread(year, value, fill = 0)

#> # A tibble: 3 x 157
#>      id Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858 Y1859 Y1860
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     0     0     0     0     0     0     0     0     0     0     0
#> 2     2     1     1     1     1     1     1     1     1     1     1     1
#> 3     3     0     0     0     0     0     0     0     0     0     0     0
#> # ... with 145 more variables: Y1861 <dbl>, Y1862 <dbl>, Y1863 <dbl>,...

Upvotes: 2

Shree

Reputation: 11140

If I understand correctly, you want a dataframe with all years for every id -

library(dplyr)
library(tidyr)

df %>% 
  group_by(id) %>% 
  transmute(years = list(paste0("Y", start.date:end.date)), value = 1) %>% 
  unnest() %>% 
  ungroup() %>% 
  spread(years, value, fill = 0)

# showing first 10 of total 157 columns
# A tibble: 3 x 10
     id Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     0     0     0     0     0     0     0     0     0
2     2     1     1     1     1     1     1     1     1     1
3     3     0     0     0     0     0     0     0     0     0

Upvotes: 1

Is there a way to create dummy variables for years that fall between two time points?

Answers (3)

Related Questions