Modified rollapply mean

Question

I have a data file which consists of daily xy locations and a logical vector denoting whether or not the location is an outlier. Here is some (poorly created, I know) sample data:

x=seq(3,10,length.out=30)
y=seq(42,45,length.out=30)
outlier=c(F,F,F,F,F,F,F,F,T,T,T,F,F,F,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F)
data=cbind(x,y,outlier)
> data
             x           y outlier
 [1,]  3.000000000 42.00000000       0
 [2,]  3.241379310 42.10344828       0
 [3,]  3.482758621 42.20689655       0
 [4,]  3.724137931 42.31034483       0
 [5,]  3.965517241 42.41379310       0
 [6,]  4.206896552 42.51724138       0
 [7,]  4.448275862 42.62068966       0
 [8,]  4.689655172 42.72413793       0
 [9,]  4.931034483 42.82758621       1
[10,]  5.172413793 42.93103448       1
[11,]  5.413793103 43.03448276       1
[12,]  5.655172414 43.13793103       0
[13,]  5.896551724 43.24137931       0
[14,]  6.137931034 43.34482759       0
[15,]  6.379310345 43.44827586       0
[16,]  6.620689655 43.55172414       0
[17,]  6.862068966 43.65517241       0
[18,]  7.103448276 43.75862069       0
[19,]  7.344827586 43.86206897       0
[20,]  7.586206897 43.96551724       0
[21,]  7.827586207 44.06896552       0
[22,]  8.068965517 44.17241379       1
[23,]  8.310344828 44.27586207       0
[24,]  8.551724138 44.37931034       1
[25,]  8.793103448 44.48275862       0
[26,]  9.034482759 44.58620690       0
[27,]  9.275862069 44.68965517       0
[28,]  9.517241379 44.79310345       0
[29,]  9.758620690 44.89655172       0
[30,] 10.000000000 45.00000000       0

What I need is to take a non-overlapping 6-day mean of the x and y columns. This is easy enough with rollapply(). However, I do not want outlier=1 values to be included in the 6-day mean; nor do I want the 6-day window to 'span' the gap left behind by removing all rows where outlier=T. Instead, I want to make an exception to the 'non-overlapping rule'.

I think this is best explained using the sample data above: the first value should be the mean of rows 1:6, but rather than the second value being the mean of rows 7:12 (including outlier=1 values) or of rows c(7:8,12:15) (skipping over outlier=1 values) I want it to overlap with the first window and take the mean of rows 3:8.

So for the length 30 sample data above, the end result should be of length 5, showing the mean values of rows 1:6, 3:8, 12:17, 16:21 & 25:30 (ideally all values which result from overlapping windows should be labelled as such; i.e. values 1:4 overlap, whereas the final value is unique)

mrip · Accepted Answer

Here is a function that will give you the indices of the endpoints of the averages that you want:

findIndices<-function(outlier,window=6){
  r<-rle(outlier)
  rends<-cumsum(r$lengths)
  segs<-cbind(rends-r$lengths+1,rends)
  segs<-segs[with(r,lengths>=window & values==0),]

  indices<-unlist(apply(segs,1,function(x) seq(x[1]+window-1,x[2],by=window)))
  sort(unique(c(indices,segs[,2])))     
}

findIndices(data[,3])
## [1]  6  8 17 21 30

You can then get the averages you want like this:

id<-findIndices(data[,3])
require(zoo)
cbind(index=id,rollmean(data[,1:2],6)[id-5,])
##     index        x        y
## [1,]     6 3.603448 42.25862
## [2,]     8 4.086207 42.46552
## [3,]    17 6.258621 43.39655
## [4,]    21 7.224138 43.81034
## [5,]    30 9.396552 44.74138

You can put it all together in a single function like this:

maWithOutliers<-function(x,outlier,window){
  id<-findIndices(outlier,window)
  cbind(index=id,rollmean(x,window)[id-window+1,])
}

> maWithOutliers(data[,1:2],data[,3],6)
     index        x        y
[1,]     6 3.603448 42.25862
[2,]     8 4.086207 42.46552
[3,]    17 6.258621 43.39655
[4,]    21 7.224138 43.81034
[5,]    30 9.396552 44.74138
> maWithOutliers(data[,1:2],data[,3],4)
     index        x        y
[1,]     4 3.362069 42.15517
[2,]     8 4.327586 42.56897
[3,]    15 6.017241 43.29310
[4,]    19 6.982759 43.70690
[5,]    21 7.465517 43.91379
[6,]    28 9.155172 44.63793
[7,]    30 9.637931 44.84483
>

Modified rollapply mean

Answers (1)

Related Questions