Reputation: 5897
I simulated (normally distributed) random data for each day over a period from January-1-2014 to January-1-2016. I then tried to take the totals of this data at regular periods of 8 days. Here is my code (I added a seed for reproducibility):
library(xts)
library(ggplot2)
set.seed(123)
#simulate data
property_damages_in_dollars <- rnorm(731,100,10)
date_decision_made = seq(as.Date("2014/1/1"), as.Date("2016/1/1"),by="day")
date_decision_made <- format(as.Date(date_decision_made), "%Y/%m/%d")
final_data <- data.frame(date_decision_made, property_damages_in_dollars)
#convert to xts object
dat <- xts(final_data$property_damages_in_dollars,
as.Date(final_data$date_decision_made, '%Y/%m/%d'))
#aggregate by 8 day period
ep <- endpoints(dat,'days',k=8)
#final aggregated file
a = period.apply(x=dat,ep,FUN=sum )
#plot
a_df <- fortify(a)
ggplot(a_df, aes(x = Index, y = a)) + geom_line()
However, there appears to be significant irregular "spikes" when the data was summed at 8 day periods, suggesting that there might be some mistakes while summing the data:
Towards the end of the graph, there appears to be a "drop" towards the end. This looks suspicious - but somewhat understandable.
In the middle of the graph, there is a very noticeable "sharp drop" - this really appears to be a calculation error. This drop happens when the year transitions from 2015 (December) to 2016, the corresponding (aggregated) numbers associated with this time also appear to be low.
Can anyone explain these "drops" and suggest how they can be fixed? Thanks
Upvotes: 1
Views: 56
Reputation: 7611
Looks like the endpoints
function takes into account the end of the year, for some reason. But it also looks like it just creates a numeric vector containing the number of days in each period. So replacing it with something like
total_days <- 731
period_length <- 8
ep <- seq(0, total_days, period_length)
if (ep[length(ep)] < total_days) {
ep[length(ep) + 1] <- total_days
}
seems to work. I used the if
because seq
seems to cut off short if total_days
isn't a multiple of period_length
. There's probably a neater way: see this question for possible solutions, if you're interested.
This seems to fix the middle dip; the last one is because there's not a full 8 days' worth of data in that period (I think).
Upvotes: 1