Reputation: 965

Removing Rows Based on Not Enough Repeated Data in a Large Data Set in R

I am looking to do a 4 day rolling average over a large set of data. The problem is that some individuals do not have 4 cases and thus I get an error indicating that k <= n is not TRUE.

Is there a way to remove any individual that does not have enough data in the data set?

Here is an example of how the data would look:

     Name  variable.1
1     Kim   64.703950
2     Kim  926.339849
3     Kim  128.662977
4     Kim  290.888594
5     Kim  869.418523
6     Bob  594.973849
7     Bob  408.159544
8     Bob  609.140928
9  Joseph  496.779712
10 Joseph  444.028668
11 Joseph -213.375635
12 Joseph  -76.728981
13 Joseph  265.642784
14   Hank  -91.646728
15   Hank  170.209746
16   Hank   97.889889
17   Hank   12.069074
18   Hank  402.361731
19   Earl  721.941796
20   Earl    4.823148
21   Earl  696.299627

Upvotes: 5

Answers (4)

Quinn Weber

Reputation: 927

You could create a second data.frame that is aggregated up to the user level, with a count for each user. Then join that data.frame onto the original one by user, then subset the new data.frame to where count >= 4

Upvotes: 0

Steven Beaupré

Reputation: 21641

Try:

library(zoo)
library(dplyr)
df %>%
  group_by(Name) %>%
  filter(n() >= 4) %>%
  mutate(daymean = rollmean(variable.1, 4, align="right", na.pad=TRUE))

This will only keep groups larger or equal to 4 and calculate a 4 day rolling average on variable.1.

#     Name variable.1  daymean
#1     Kim   64.70395       NA
#2     Kim  926.33985       NA
#3     Kim  128.66298       NA
#4     Kim  290.88859 352.6488
#5     Kim  869.41852 553.8275
#6  Joseph  496.77971       NA
#7  Joseph  444.02867       NA
#8  Joseph -213.37563       NA
#9  Joseph  -76.72898 162.6759
#10 Joseph  265.64278 104.8917
#11   Hank  -91.64673       NA
#12   Hank  170.20975       NA
#13   Hank   97.88989       NA
#14   Hank   12.06907  47.1305
#15   Hank  402.36173 170.6326

Upvotes: 2

BrodieG

Reputation: 52687

Here are two options in base, one with ave where we produce a vector that has for each row in a group, the length of that group (ave will recycle its result to fill a group):

subset(DF, ave(seq(Name), Name, FUN=length) > 4)

And another with table where we count the items in each group and use %in% to only keep the rows that belong to groups with enough items.

subset(DF, Name %in% names(table(Name)[table(Name) > 4]))

Both produce:

     Name variable.1
1     Kim   64.70395
2     Kim  926.33985
3     Kim  128.66298
4     Kim  290.88859
5     Kim  869.41852
9  Joseph  496.77971
10 Joseph  444.02867
11 Joseph -213.37563
12 Joseph  -76.72898
13 Joseph  265.64278
14   Hank  -91.64673
15   Hank  170.20975
16   Hank   97.88989
17   Hank   12.06907
18   Hank  402.36173

Upvotes: 0

davechilders

Reputation: 9143

If your data frame is df, you can remove all names that occur fewer than 4 times with dplyr:

library(dplyr)

df %>%
  group_by(Name) %>%
  filter(n() >= 4)

Upvotes: 4

Removing Rows Based on Not Enough Repeated Data in a Large Data Set in R

Answers (4)

Related Questions