stats_noob
stats_noob

Reputation: 5897

R: errors when taking sum totals?

I simulated (normally distributed) random data for each day over a period from January-1-2014 to January-1-2016. I then tried to take the totals of this data at regular periods of 8 days. Here is my code (I added a seed for reproducibility):

library(xts)
library(ggplot2)

set.seed(123)
    
#simulate data
    property_damages_in_dollars <- rnorm(731,100,10)

date_decision_made = seq(as.Date("2014/1/1"), as.Date("2016/1/1"),by="day")
    
    date_decision_made <- format(as.Date(date_decision_made), "%Y/%m/%d")
    
final_data <- data.frame(date_decision_made, property_damages_in_dollars)

#convert to xts object
dat <- xts(final_data$property_damages_in_dollars, 
           as.Date(final_data$date_decision_made, '%Y/%m/%d'))

#aggregate by 8 day period
ep <- endpoints(dat,'days',k=8)

#final aggregated file
a = period.apply(x=dat,ep,FUN=sum )

#plot
a_df <- fortify(a)
 ggplot(a_df, aes(x = Index, y = a)) + geom_line()

However, there appears to be significant irregular "spikes" when the data was summed at 8 day periods, suggesting that there might be some mistakes while summing the data:

enter image description here

  1. Towards the end of the graph, there appears to be a "drop" towards the end. This looks suspicious - but somewhat understandable.

  2. In the middle of the graph, there is a very noticeable "sharp drop" - this really appears to be a calculation error. This drop happens when the year transitions from 2015 (December) to 2016, the corresponding (aggregated) numbers associated with this time also appear to be low.

Can anyone explain these "drops" and suggest how they can be fixed? Thanks

Upvotes: 1

Views: 56

Answers (1)

Hobo
Hobo

Reputation: 7611

Looks like the endpoints function takes into account the end of the year, for some reason. But it also looks like it just creates a numeric vector containing the number of days in each period. So replacing it with something like

total_days <- 731
period_length <- 8

ep <- seq(0, total_days, period_length)
if (ep[length(ep)] < total_days) {
  ep[length(ep) + 1] <- total_days
}

seems to work. I used the if because seq seems to cut off short if total_days isn't a multiple of period_length. There's probably a neater way: see this question for possible solutions, if you're interested.

This seems to fix the middle dip; the last one is because there's not a full 8 days' worth of data in that period (I think).

Upvotes: 1

Related Questions