Reputation: 51
I have a medium size data set, and here is an example taken from the data set:
2011.2012
9/7
11/5
12/15
1/5
2/5
I'd like to convert this data into a time series format.
After converting them into characters from factors, I used the as.Dates function, but I encountered a glitch.
The results assume the missing year is the current year. My goal is to be able to convert the dates before 1/1 into year 2011, and those after 1/1 into year 2012. The data ranges between September 2011 and April 2012.
I've tried using origin and start, but to no avail. Here are my codes:
date1 <- as.character(2011.2012)
date1 <- as.Date(date1, format="%m/%d")
Upvotes: 1
Views: 2539
Reputation: 263342
I'm posting this which can be called a vectorized approach, unlike that of those offered so far. I think ifelse
is pseudo vectorization, since three vectors need to be constructed and then selected.
dat <- read.table(text="2011.2012
9/7
11/5
12/15
1/5
2/5", header=TRUE)
dat$date1 <- as.Date(dat$X2011.2012, format="%m/%d")
dat$GT <- c(FALSE, diff(dat$date1) < 0)
startyr <- cumsum( as.numeric( substr(names(dat)[1], 2,5) ) )
dat$truedate <- paste( format(dat$date1, format="%m/%d") ,
dat$GT+startyr, sep="-")
dat
#-------------------------
X2011.2012 date1 GT truedate
1 9/7 2012-09-07 0 09/07-2011
2 11/5 2012-11-05 0 11/05-2011
3 12/15 2012-12-15 0 12/15-2011
4 1/5 2012-01-05 1 01/05-2012
5 2/5 2012-02-05 1 02/05-2012
I thought that reading the first line as a header to be perfectly legit, but if needed, an alternative that used the cumsum
of the diff.Date
operation should still be 'vectorized'. This would not be restricted to only two years and should be successful if there were even one date per year.
Upvotes: 0
Reputation: 13103
Here is what I came up with. I do not know that this code will always work, but it seems to work with the example data set I used. The code seems to handle >2 years and any day of the year.
The code cannot handle a year for which there are no data, but if year is not in the data set then such a gap probably could not be identified regardless.
Note also that this approach will fail with the following two dates: "1/30" and "3/1", if the two dates are from two consecutive years. That is because there is such a long gap between the two dates that there is no way for the computer to realize the two dates do not come from the same year.
In other words, if there are very long gaps between two consecutive dates any approach is likely to fail without additional information. If there is, for example, at least one date from every quarter or half year then I think both posted answers will work because the computer will be able to identify a decrease in consecutive months as indicating a new year.
Maybe both approaches will work if the longest gap between two consecutive dates is 11 months. Maybe a gap of 363 days would be okay if the code was modified to also check the day of the month for each of two consecutive dates.
# specify the initial year and create dates from the data
initial.year = 2010
date <- c("12/30", "1/1", "6/1", "6/1", "10/25", "11/27", "12/28",
"1/16", "2/17", "2/17", "2/17")
DDD3 <- as.Date(date, format="%m/%d")
# deconstruct dates into month, day and erroneous year
dtstr <- as.character(DDD3)
month <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 2)))
day <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 3)))
year <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 1)))
DDD4 <- data.frame(month, day, year)
# obtain correct year for each date
year2=rep(NA, nrow(DDD4))
year2[1] = initial.year
for(i in 2:length(year2)) {
if(DDD4[i,1] < DDD4[(i-1),1]) (year2[i] = year2[(i-1)]+1)
if(DDD4[i,1] >= DDD4[(i-1),1]) (year2[i] = year2[(i-1)])
}
# create new dates using correct year
day2 <- sprintf("%02d", day)
month2 <- sprintf("%02d", month)
year2 <- as.character(year2)
DDD5 <- data.frame(month2, day2, year2)
DDD6 <- paste(DDD5[,1], DDD5[,2], DDD5[,3], sep='/')
DDD7 <- as.Date(DDD6, "%m/%d/%Y")
DDD7
# [1] "2010-12-30" "2011-01-01" "2011-06-01" "2011-06-01"
# "2011-10-25" "2011-11-27" "2011-12-28" "2012-01-16"
# "2012-02-17" "2012-02-17" "2012-02-17"
Upvotes: 3
Reputation: 42649
How about splitting on June/July? This depends on the format of your dates.
> x=c("9/7", "11/5", "12/15", "1/5", "2/5" )
> sapply(strsplit(x, '/')
, function(x) paste(if(as.numeric(x[1]) > 6) 2011 else 2012, x[1], x[2]
, sep="/"
)
)
[1] "2011/9/7" "2011/11/5" "2011/12/15" "2012/1/5" "2012/2/5"
Here is a vectorized approach to the above, which uses ifelse
rather than if
:
mm <- matrix(nrow=2, unlist(strsplit(x, '/')))
paste(ifelse(as.numeric(mm[1,]) > 6, 2011, 2012), mm[1,], mm[2,], sep='/')
[1] "2011/9/7" "2011/11/5" "2011/12/15" "2012/1/5" "2012/2/5"
The vectorized approach is less readable, but is much faster (1.7x).
Here is one way which you could vectorize @MarkMiller's approach, using the Date
functions, and looking for a regression in the month:
initialYear <- 2011
dd <- as.Date(x, "%m/%d")
mon <- format(dd, "%m")
as.Date(paste(initialYear + c(0, cumsum(diff(as.numeric(mon))<0))
, mon
, format(dd, "%d")
, sep="-"
)
)
[1] "2011-09-07" "2011-11-05" "2011-12-15" "2012-01-05" "2012-02-05"
Probably because of all the Date
functions, this takes 3.6x longer to run than the vectorized approach above (2.6x if the final as.Date
is removed), which is limited to 2011 and 2012. I did not measure Mark's code, but it is probably faster than the three sapply
's and an explicit for
loop.
Upvotes: 6