Reputation: 381
I have a data file which consists of daily xy locations and a logical vector denoting whether or not the location is an outlier. Here is some (poorly created, I know) sample data:
x=seq(3,10,length.out=30)
y=seq(42,45,length.out=30)
outlier=c(F,F,F,F,F,F,F,F,T,T,T,F,F,F,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F)
data=cbind(x,y,outlier)
> data
x y outlier
[1,] 3.000000000 42.00000000 0
[2,] 3.241379310 42.10344828 0
[3,] 3.482758621 42.20689655 0
[4,] 3.724137931 42.31034483 0
[5,] 3.965517241 42.41379310 0
[6,] 4.206896552 42.51724138 0
[7,] 4.448275862 42.62068966 0
[8,] 4.689655172 42.72413793 0
[9,] 4.931034483 42.82758621 1
[10,] 5.172413793 42.93103448 1
[11,] 5.413793103 43.03448276 1
[12,] 5.655172414 43.13793103 0
[13,] 5.896551724 43.24137931 0
[14,] 6.137931034 43.34482759 0
[15,] 6.379310345 43.44827586 0
[16,] 6.620689655 43.55172414 0
[17,] 6.862068966 43.65517241 0
[18,] 7.103448276 43.75862069 0
[19,] 7.344827586 43.86206897 0
[20,] 7.586206897 43.96551724 0
[21,] 7.827586207 44.06896552 0
[22,] 8.068965517 44.17241379 1
[23,] 8.310344828 44.27586207 0
[24,] 8.551724138 44.37931034 1
[25,] 8.793103448 44.48275862 0
[26,] 9.034482759 44.58620690 0
[27,] 9.275862069 44.68965517 0
[28,] 9.517241379 44.79310345 0
[29,] 9.758620690 44.89655172 0
[30,] 10.000000000 45.00000000 0
What I need is to take a non-overlapping 6-day mean of the x and y columns. This is easy enough with rollapply()
. However, I do not want outlier=1
values to be included in the 6-day mean; nor do I want the 6-day window to 'span' the gap left behind by removing all rows where outlier=T
. Instead, I want to make an exception to the 'non-overlapping rule'.
I think this is best explained using the sample data above: the first value should be the mean of rows 1:6, but rather than the second value being the mean of rows 7:12 (including outlier=1
values) or of rows c(7:8,12:15) (skipping over outlier=1
values) I want it to overlap with the first window and take the mean of rows 3:8.
So for the length 30 sample data above, the end result should be of length 5, showing the mean values of rows 1:6, 3:8, 12:17, 16:21 & 25:30 (ideally all values which result from overlapping windows should be labelled as such; i.e. values 1:4 overlap, whereas the final value is unique)
Upvotes: 1
Views: 663
Reputation: 15163
Here is a function that will give you the indices of the endpoints of the averages that you want:
findIndices<-function(outlier,window=6){
r<-rle(outlier)
rends<-cumsum(r$lengths)
segs<-cbind(rends-r$lengths+1,rends)
segs<-segs[with(r,lengths>=window & values==0),]
indices<-unlist(apply(segs,1,function(x) seq(x[1]+window-1,x[2],by=window)))
sort(unique(c(indices,segs[,2])))
}
findIndices(data[,3])
## [1] 6 8 17 21 30
You can then get the averages you want like this:
id<-findIndices(data[,3])
require(zoo)
cbind(index=id,rollmean(data[,1:2],6)[id-5,])
## index x y
## [1,] 6 3.603448 42.25862
## [2,] 8 4.086207 42.46552
## [3,] 17 6.258621 43.39655
## [4,] 21 7.224138 43.81034
## [5,] 30 9.396552 44.74138
You can put it all together in a single function like this:
maWithOutliers<-function(x,outlier,window){
id<-findIndices(outlier,window)
cbind(index=id,rollmean(x,window)[id-window+1,])
}
> maWithOutliers(data[,1:2],data[,3],6)
index x y
[1,] 6 3.603448 42.25862
[2,] 8 4.086207 42.46552
[3,] 17 6.258621 43.39655
[4,] 21 7.224138 43.81034
[5,] 30 9.396552 44.74138
> maWithOutliers(data[,1:2],data[,3],4)
index x y
[1,] 4 3.362069 42.15517
[2,] 8 4.327586 42.56897
[3,] 15 6.017241 43.29310
[4,] 19 6.982759 43.70690
[5,] 21 7.465517 43.91379
[6,] 28 9.155172 44.63793
[7,] 30 9.637931 44.84483
>
Upvotes: 2