Reputation: 83
Here is an example of my dataframe. I am working in R.
date name count
2016-11-12 Joe 5
2016-11-15 Bob 5
2016-06-15 Nick 12
2016-10-16 Cate 6
I would like to add a column to my data frame that will tell me the season that corresponds to the date. I would like it to look like this:
date name count Season
2016-11-12 Joe 5 Winter
2016-11-15 Bob 5 Winter
2017-06-15 Nick 12 Summer
2017-10-16 Cate 6 Fall
I have started some code:
startWinter <- c(month.name[1], month.name[12], month.name[11])
startSummer <- c(month.name[5], month.name[6], month.name[7])
startSpring <- c(month.name[2], month.name[3], month.name[4])
# create a function to find the correct season based on the month
MonthSeason <- function(Month) {
# !is.na()
# ignores values with NA
# match()
# returns a vector of the positions of matches
# If the starting month matches a spring season, print "Spring". If the starting month matches a summer season, print "Summer" etc.
ifelse(!is.na(match(Month, startSpring)),
return("spring"),
return(ifelse(!is.na(match(Month, startWinter)),
"winter",
ifelse(!is.na(match(Month, startSummer)),
"summer","fall"))))
}
This code gives me the season for a month. Im not sure if I am going about this problem in the right way. Can anyone help me out? Thanks!
Upvotes: 2
Views: 3435
Reputation: 160447
There are a couple of hacks, and their usability depends on whether you want to use meteorological or astronomical seasons. I'll offer both, I think they offer sufficient flexibility.
I'm going to use your second data provided, since it provides more than just "Winter".
txt <- "date name count
2016-11-12 Joe 5
2016-11-15 Bob 5
2017-06-15 Nick 12
2017-10-16 Cate 6"
dat <- read.table(text = txt, header = TRUE, stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)
The quickest method works well when seasons are defined strictly by month.
metseasons <- c(
"01" = "Winter", "02" = "Winter",
"03" = "Spring", "04" = "Spring", "05" = "Spring",
"06" = "Summer", "07" = "Summer", "08" = "Summer",
"09" = "Fall", "10" = "Fall", "11" = "Fall",
"12" = "Winter"
)
metseasons[format(dat$date, "%m")]
# 11 11 06 10
# "Fall" "Fall" "Summer" "Fall"
If you choose to use date ranges for your seasons that are not defined by month start/stop such as the astronomical seasons, here's another 'hack':
astroseasons <- as.integer(c("0000", "0320", "0620", "0922", "1221", "1232"))
astroseasons_labels <- c("Winter", "Spring", "Summer", "Fall", "Winter")
If you use proper Date
or POSIX
types, then you are including years, which makes things a little less-generic. One might think of using julian dates, but during leap years this produces anomalies. So, with the assumption that Feb 28 is never a seasonal boundary, I'm "numericizing" the month-day. Even though R does do character-comparisons just fine, cut
expects numbers, so we convert them to integers.
Two safe-guards: because cut
is either right-open (and left-closed) or right-closed (and left-open), then our two book-ends need to extend beyond the legal dates, ergo "0000"
and "1232"
. There are other techniques that could work equally well here (e.g., using -Inf
and Inf
, post-integerization).
astroseasons_labels[ cut(as.integer(format(dat$date, "%m%d")), astroseasons, labels = FALSE) ]
# [1] "Fall" "Fall" "Spring" "Fall"
Notice that the third date is in Spring when using astronomical seasons and Summer otherwise.
This solution can easily be adjusted to account for the Southern hemisphere or other seasonal preferences/beliefs.
Edit: motivated by @Kristofersen's answer (thanks), I looked into benchmarks. lubridate::month
uses a POSIXct
-to-POSIXlt
conversion to extract the month, which can be over 10x faster than my format(x, "%m")
method. As such:
metseasons2 <- c(
"Winter", "Winter",
"Spring", "Spring", "Spring",
"Summer", "Summer", "Summer",
"Fall", "Fall", "Fall",
"Winter"
)
Noting that as.POSIXlt
returns 0-based months, we add 1:
metseasons2[ 1 + as.POSIXlt(dat$date)$mon ]
# [1] "Fall" "Fall" "Summer" "Fall"
Comparison:
library(lubridate)
library(microbenchmark)
set.seed(42)
x <- Sys.Date() + sample(1e3)
xlt <- as.POSIXlt(x)
microbenchmark(
metfmt = metseasons[ format(x, "%m") ],
metlt = metseasons2[ 1 + xlt$mon ],
astrofmt = astroseasons_labels[ cut(as.integer(format(x, "%m%d")), astroseasons, labels = FALSE) ],
astrolt = astroseasons_labels[ cut(100*(1+xlt$mon) + xlt$mday, astroseasons, labels = FALSE) ],
lubridate = sapply(month(x), seasons)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# metfmt 1952.091 2135.157 2289.63943 2212.1025 2308.1945 3748.832 100
# metlt 14.223 16.411 22.51550 20.0575 24.7980 68.924 100
# astrofmt 2240.547 2454.245 2622.73109 2507.8520 2674.5080 3923.874 100
# astrolt 42.303 54.702 72.98619 66.1885 89.7095 163.373 100
# lubridate 5906.963 6473.298 7018.11535 6783.2700 7508.0565 11474.050 100
So the methods using as.POSIXlt(...)$mon
are significantly faster. (@Kristofersen's answer could be improved by vectorizing it, perhaps with ifelse
, but that still won't compare to the speed of the vector lookups with or without cut
.)
Upvotes: 3
Reputation: 2806
You can do this pretty quickly with lubridate and a function to change the month number into a season.
library(lubridate)
seasons = function(x){
if(x %in% 2:4) return("Spring")
if(x %in% 5:7) return("Summer")
if(x %in% 8:10) return("Fall")
if(x %in% c(11,12,1)) return("Winter")
}
dat$Season = sapply(month(dat$date), seasons)
> dat
date name count Season
1 2016-11-12 Joe 5 Winter
2 2016-11-15 Bob 5 Winter
3 2016-06-15 Nick 12 Summer
4 2016-10-16 Cate 6 Fall
Upvotes: 3
Reputation: 692
if your data is df:
# create dataframe for month and corresponding season
dfSeason <- data.frame(season = c(rep("Winter", 3), rep("Summer", 3),
rep("Spring", 3), rep("Fall", 3)),
month = month.name[c(11,12,1, 5:7, 2:4, 8:10)],
stringsAsFactors = F)
# make date as date
df$data <- as.Date(df$date)
# match the month of the date in df (format %B) with month in season
# then use it to index the season of dfSeason
df$season <- dfSeason$season[match(format(df$data, "%B"), dfSeason$month)]
Upvotes: 1