Dan
Dan

Reputation: 1778

How to Create ID Variable based on Rolling Date Rule?

This is embarrassing: I admit several pieces of code of mine have a similar structure to this:

enter image description here

Arghhhh! Do real programmers cringe when they see something like this?

The figure should be self-explanatory: I need a variable called Season that returns an ID "(year-1)_(year)" based on the WEATHERDATE column.

So, any row with dates from '1998-06-15' to '1999-06-14' should return "1998-1999" under the Season ID Column.

The WEATHERDATE column simply runs from 1998-01-01 to TODAY.

I bet there's a one/two liner solution to this. I've tried the zoo package but no success.

Any R wizard to point me to the obvious solution I'm missing here?

***REPRODUCIBLE EXAMPLE:

WEATHERDATE <- seq(as.Date("1998-01-01"), len=99999, by=1)
VARIABLE    <- rnorm(n = length(WEATHERDATE))
data_mex    <- data.frame(WEATHERDATE, VARIABLE)
## how to create SEASON based on dates??
# I would then run the code block from above, something like:
data_mex <- within(data_mex, Season[DATEWEATHER >= ymd(StartOfSeason)+365*0   & DATEWEATHER < ymd(StartOfSeason)+365*1 ] <- "1998-1999") 

Of course a solution where I don't have to come back and touch the code every year would be ideal :)

Upvotes: 1

Views: 181

Answers (2)

ajrwhite
ajrwhite

Reputation: 8458

You can use cut.Date and seq.Date to do this in two lines in base R:

seasons <- format(as.Date(cut.Date(as.Date(data_mex$WEATHERDATE),
               breaks=seq.Date(as.Date("1997-06-15"),
                 as.Date("2280-06-15", "year"))), "%Y")
data_mex$seasons <- paste0(seasons, "-", as.numeric(seasons) + 1)

Note I have used "2280-06-15" as the latest date for your seasons, but you may want to replace that with Sys.Date(), or whenever is appropriate for your task.

This returns:

> head(data_mex)
  WEATHERDATE   VARIABLE   seasons
1  1998-01-01 -0.2260734 1997-1998
2  1998-01-02  0.3222805 1997-1998
3  1998-01-03 -0.1554167 1997-1998
4  1998-01-04 -0.5591154 1997-1998
5  1998-01-05  1.0729737 1997-1998
6  1998-01-06  1.0030025 1997-1998

> tail(data_mex)
      WEATHERDATE   VARIABLE   seasons
99994  2271-10-10 0.59986466 2271-2272
99995  2271-10-11 0.37304603 2271-2272
99996  2271-10-12 1.30822156 2271-2272
99997  2271-10-13 0.01204986 2271-2272
99998  2271-10-14 0.87340544 2271-2272
99999  2271-10-15 0.44098083 2271-2272

Upvotes: 6

Gopala
Gopala

Reputation: 10483

Here is one solution you can use to get the desired output:

data_mex$seasonId <-
  with(data_mex,
       ifelse(as.numeric(format(WEATHERDATE, '%m')) >= 6 &
                as.numeric(format(WEATHERDATE, '%d')) >= 15,
              paste(format(WEATHERDATE, '%Y'),
                    as.numeric(format(WEATHERDATE, '%Y')) + 1, sep = '-'),
              paste(as.numeric(format(WEATHERDATE, '%Y')) - 1,
                    format(WEATHERDATE, '%Y'), sep = '-')))

Resulting output is:

head(data_mex)
  WEATHERDATE    VARIABLE  seasonId
1  1998-01-01 -0.02541724 1997-1998
2  1998-01-02 -0.48725913 1997-1998
3  1998-01-03  0.06204881 1997-1998
4  1998-01-04 -0.15207281 1997-1998
5  1998-01-05 -0.61809248 1997-1998
6  1998-01-06 -1.20427734 1997-1998

tail(data_mex)
      WEATHERDATE   VARIABLE  seasonId
99994  2271-10-10 -0.8976144 2271-2272
99995  2271-10-11 -0.5684972 2271-2272
99996  2271-10-12 -0.8847031 2271-2272
99997  2271-10-13 -0.2659979 2271-2272
99998  2271-10-14 -0.9108352 2271-2272
99999  2271-10-15  1.1601581 2271-2272

You extract the year portion each time, and paste the lagging or leading year based on the month and day value. format method is allowing you extract specific portions of the date.

Upvotes: 1

Related Questions