Reputation: 965
I am looking to do a 4 day rolling average over a large set of data. The problem is that some individuals do not have 4 cases and thus I get an error indicating that k <= n is not TRUE.
Is there a way to remove any individual that does not have enough data in the data set?
Here is an example of how the data would look:
Name variable.1
1 Kim 64.703950
2 Kim 926.339849
3 Kim 128.662977
4 Kim 290.888594
5 Kim 869.418523
6 Bob 594.973849
7 Bob 408.159544
8 Bob 609.140928
9 Joseph 496.779712
10 Joseph 444.028668
11 Joseph -213.375635
12 Joseph -76.728981
13 Joseph 265.642784
14 Hank -91.646728
15 Hank 170.209746
16 Hank 97.889889
17 Hank 12.069074
18 Hank 402.361731
19 Earl 721.941796
20 Earl 4.823148
21 Earl 696.299627
Upvotes: 5
Views: 1251
Reputation: 927
You could create a second data.frame that is aggregated up to the user level, with a count for each user. Then join that data.frame onto the original one by user, then subset the new data.frame to where count >= 4
Upvotes: 0
Reputation: 21641
Try:
library(zoo)
library(dplyr)
df %>%
group_by(Name) %>%
filter(n() >= 4) %>%
mutate(daymean = rollmean(variable.1, 4, align="right", na.pad=TRUE))
This will only keep groups larger or equal to 4 and calculate a 4 day rolling average on variable.1
.
# Name variable.1 daymean
#1 Kim 64.70395 NA
#2 Kim 926.33985 NA
#3 Kim 128.66298 NA
#4 Kim 290.88859 352.6488
#5 Kim 869.41852 553.8275
#6 Joseph 496.77971 NA
#7 Joseph 444.02867 NA
#8 Joseph -213.37563 NA
#9 Joseph -76.72898 162.6759
#10 Joseph 265.64278 104.8917
#11 Hank -91.64673 NA
#12 Hank 170.20975 NA
#13 Hank 97.88989 NA
#14 Hank 12.06907 47.1305
#15 Hank 402.36173 170.6326
Upvotes: 2
Reputation: 52687
Here are two options in base, one with ave
where we produce a vector that has for each row in a group, the length of that group (ave
will recycle its result to fill a group):
subset(DF, ave(seq(Name), Name, FUN=length) > 4)
And another with table
where we count the items in each group and use %in%
to only keep the rows that belong to groups with enough items.
subset(DF, Name %in% names(table(Name)[table(Name) > 4]))
Both produce:
Name variable.1
1 Kim 64.70395
2 Kim 926.33985
3 Kim 128.66298
4 Kim 290.88859
5 Kim 869.41852
9 Joseph 496.77971
10 Joseph 444.02867
11 Joseph -213.37563
12 Joseph -76.72898
13 Joseph 265.64278
14 Hank -91.64673
15 Hank 170.20975
16 Hank 97.88989
17 Hank 12.06907
18 Hank 402.36173
Upvotes: 0
Reputation: 9143
If your data frame is df
, you can remove all names that occur fewer than 4 times with dplyr
:
library(dplyr)
df %>%
group_by(Name) %>%
filter(n() >= 4)
Upvotes: 4