Reputation: 7107
I am trying to filter my data such that, in some dates I have too many observations and I want to cut a few of them out. For example in the data below I have for 2005-03-31
164 observations which distorts my results somewhat. What I have done is compute some ranking method cosine_ntile
which goes from 1:5. For the observations which has "too many" observations I am only interested in the top ntile, which is why I grouped by cosine_ntile
and created more ntiles called cosine_ntile_ntile
.
Now I am only interested in ntile 1 of cosine_ntile_ntile
if the number of observations exceed 100 for example. What I tried is filter(total_obs > 100 & cosine_ntile_ntile == 1)
which is not giving the desired output.
I am trying to filter if total_obs
is greater than 100, then remove the observatios which have a score of 2, 3, 4 or 5 in the cosine_ntile_ntile
column.
Data looks like:
# A tibble: 200 x 4
filing_date_EoM total_obs cosine_ntile cosine_ntile_ntile
<date> <int> <dbl> <int>
1 2005-07-31 6 1 4
2 2005-09-30 15 4 2
3 2005-03-31 164 1 4
4 2005-12-31 14 2 3
5 2005-11-30 5 5 4
6 2005-03-31 164 4 3
7 2005-12-31 14 5 5
8 2005-12-31 14 2 4
9 2005-03-31 164 1 5
10 2005-02-28 17 3 3
Data:
structure(list(filing_date_EoM = structure(c(12995, 13056, 12873,
13148, 13117, 12873, 13148, 13148, 12873, 12842, 12964, 12934,
12873, 12964, 12842, 13117, 12873, 12873, 12903, 13148, 13117,
12873, 12873, 13056, 12873, 12934, 12873, 12842, 12873, 12842,
12814, 12873, 12995, 12873, 12873, 13056, 12873, 12903, 12873,
13148, 13056, 12964, 13026, 13056, 12903, 12903, 12873, 12903,
12873, 12873, 12873, 12873, 12842, 13148, 12842, 13026, 12934,
13117, 13026, 12873, 13026, 13056, 12873, 12903, 12873, 12873,
12873, 13117, 12873, 12873, 12873, 12964, 12873, 13148, 12873,
12903, 12873, 12873, 13056, 12873, 13087, 12873, 12873, 12873,
12903, 12873, 12873, 12873, 12873, 12873, 12873, 12873, 13056,
13026, 12873, 12873, 12903, 12903, 13117, 12873, 12873, 12873,
12873, 12903, 12964, 12873, 12873, 13056, 12873, 12873, 12995,
12873, 12995, 12873, 12873, 12873, 12873, 12814, 12873, 12873,
12995, 12873, 12873, 13148, 12814, 12873, 12873, 12903, 12873,
12873, 12842, 12842, 12873, 13148, 12873, 12873, 13148, 12873,
12842, 13056, 12964, 13056, 13148, 12873, 12873, 12873, 12903,
12964, 12873, 12873, 12873, 13056, 12964, 12873, 12995, 12873,
13148, 13148, 12934, 12873, 12873, 13026, 12873, 12873, 12873,
12842, 12873, 12873, 12873, 12873, 12873, 13148, 12903, 12873,
12873, 12873, 12873, 12873, 12873, 12873, 13056, 12873, 12814,
12873, 12964, 12842, 12842, 12873, 13056, 12842, 12873, 12873,
12873, 12873, 12873, 13117, 13026, 12873, 12903, 12873), class = "Date"),
total_obs = c(6L, 15L, 164L, 14L, 5L, 164L, 14L, 14L, 164L,
17L, 13L, 5L, 163L, 13L, 17L, 5L, 164L, 164L, 13L, 14L, 6L,
164L, 164L, 15L, 163L, 5L, 164L, 17L, 164L, 17L, 4L, 164L,
5L, 164L, 164L, 16L, 164L, 13L, 164L, 15L, 15L, 14L, 4L,
15L, 13L, 12L, 164L, 13L, 164L, 164L, 163L, 163L, 17L, 14L,
17L, 4L, 5L, 5L, 4L, 164L, 5L, 16L, 164L, 13L, 164L, 164L,
164L, 5L, 164L, 163L, 164L, 13L, 164L, 15L, 164L, 13L, 164L,
164L, 15L, 164L, 3L, 164L, 164L, 164L, 12L, 164L, 164L, 164L,
164L, 164L, 163L, 164L, 15L, 4L, 164L, 164L, 12L, 13L, 6L,
164L, 164L, 163L, 163L, 13L, 14L, 163L, 164L, 15L, 163L,
164L, 6L, 164L, 6L, 164L, 164L, 163L, 164L, 4L, 164L, 163L,
6L, 164L, 164L, 14L, 4L, 164L, 164L, 13L, 163L, 163L, 17L,
17L, 164L, 15L, 164L, 164L, 14L, 164L, 17L, 15L, 14L, 16L,
14L, 164L, 164L, 163L, 13L, 13L, 164L, 164L, 164L, 15L, 13L,
164L, 5L, 164L, 15L, 14L, 5L, 164L, 164L, 5L, 164L, 163L,
164L, 17L, 164L, 163L, 164L, 164L, 164L, 15L, 13L, 164L,
163L, 164L, 164L, 163L, 164L, 164L, 15L, 163L, 4L, 164L,
14L, 17L, 17L, 164L, 15L, 17L, 164L, 164L, 164L, 163L, 164L,
5L, 5L, 164L, 12L, 163L), cosine_ntile = c(1, 4, 1, 2, 5,
4, 5, 2, 1, 3, 2, 3, 5, 5, 5, 2, 2, 2, 4, 4, 1, 2, 1, 4,
5, 1, 1, 1, 1, 3, 1, 3, 2, 1, 3, 1, 4, 2, 1, 1, 4, 3, 4,
3, 4, 5, 2, 2, 2, 1, 5, 5, 5, 4, 3, 2, 2, 5, 2, 3, 1, 1,
1, 2, 2, 2, 3, 5, 1, 5, 4, 2, 3, 1, 2, 1, 3, 3, 3, 1, 5,
1, 4, 3, 5, 1, 1, 4, 4, 4, 5, 4, 2, 5, 2, 1, 5, 4, 1, 3,
1, 5, 5, 2, 1, 5, 4, 4, 5, 4, 1, 1, 3, 1, 2, 5, 2, 4, 1,
5, 1, 1, 4, 3, 5, 1, 2, 2, 5, 5, 2, 2, 4, 1, 2, 2, 3, 1,
2, 3, 3, 1, 2, 4, 1, 5, 1, 5, 2, 1, 3, 3, 4, 3, 4, 2, 1,
4, 4, 3, 4, 3, 2, 5, 3, 2, 3, 5, 2, 2, 4, 1, 2, 1, 5, 3,
2, 5, 1, 4, 4, 5, 3, 1, 3, 2, 2, 2, 2, 4, 2, 4, 2, 5, 1,
4, 1, 1, 5, 5), cosine_ntile_ntile = c(4L, 2L, 4L, 3L, 4L,
3L, 5L, 4L, 5L, 3L, 4L, 3L, 3L, 4L, 1L, 1L, 3L, 1L, 3L, 1L,
5L, 4L, 3L, 2L, 2L, 4L, 5L, 2L, 2L, 1L, 1L, 5L, 4L, 3L, 4L,
1L, 1L, 3L, 4L, 4L, 3L, 2L, 3L, 4L, 1L, 2L, 4L, 1L, 2L, 2L,
2L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 1L, 4L, 4L, 5L, 2L,
1L, 4L, 1L, 1L, 2L, 4L, 3L, 4L, 2L, 3L, 2L, 2L, 5L, 3L, 1L,
2L, 2L, 4L, 1L, 5L, 1L, 4L, 2L, 1L, 3L, 2L, 3L, 3L, 4L, 1L,
5L, 1L, 2L, 3L, 4L, 4L, 3L, 1L, 2L, 2L, 1L, 1L, 5L, 1L, 2L,
3L, 5L, 3L, 2L, 1L, 5L, 3L, 2L, 2L, 2L, 2L, 2L, 5L, 1L, 3L,
3L, 5L, 4L, 4L, 5L, 2L, 5L, 4L, 1L, 4L, 1L, 1L, 2L, 5L, 2L,
3L, 3L, 5L, 3L, 2L, 1L, 1L, 2L, 4L, 2L, 5L, 2L, 3L, 4L, 1L,
4L, 1L, 2L, 4L, 1L, 3L, 1L, 5L, 3L, 2L, 1L, 5L, 1L, 4L, 3L,
2L, 3L, 1L, 5L, 1L, 2L, 1L, 2L, 3L, 1L, 5L, 3L, 1L, 2L, 3L,
1L, 3L, 4L, 4L, 4L, 1L, 2L, 1L, 1L, 1L, 5L, 5L, 2L, 1L, 5L
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-200L))
EDIT: Expected output:
filing_date_EoM total_obs cosine_ntile cosine_ntile_ntile
1 2005-07-31 6 1 4
2 2005-09-30 15 4 2
3 2005-03-31 164 1 4 ~Gets removed
4 2005-12-31 14 2 3
5 2005-11-30 5 5 4
6 2005-03-31 164 4 3 ~Gets removed
7 2005-12-31 14 5 5
8 2005-12-31 14 2 4
9 2005-03-31 164 1 5 ~Gets removed
10 2005-02-28 17 3 3
11 2005-06-30 13 2 4
12 2005-05-31 5 3 3
13 2005-03-31 163 5 3 ~Gets removed
14 2005-06-30 13 5 4
15 2005-02-28 17 5 1
16 2005-11-30 5 2 1
17 2005-03-31 164 2 3 ~Gets removed
18 2005-03-31 164 2 1 ~STAYS
19 2005-04-30 13 4 3
20 2005-12-31 14 4 1
21 2005-11-30 6 1 5
22 2005-03-31 164 2 4 ~Gets removed
23 2005-03-31 164 1 3 ~Gets removed
24 2005-09-30 15 4 2
25 2005-03-31 163 5 2 ~Gets removed
26 2005-05-31 5 1 4
27 2005-03-31 164 1 5 ~Gets removed
28 2005-02-28 17 1 2
29 2005-03-31 164 1 2 ~Gets removed
30 2005-02-28 17 3 1
31 2005-01-31 4 1 1
32 2005-03-31 164 3 5 ~Gets removed
33 2005-07-31 5 2 4
34 2005-03-31 164 1 3 ~Gets removed
35 2005-03-31 164 3 4 ~Gets removed
36 2005-09-30 16 1 1
37 2005-03-31 164 4 1 ~STAYS
38 2005-04-30 13 2 3
39 2005-03-31 164 1 4 ~Gets removed
40 2005-12-31 15 1 4
41 2005-09-30 15 4 3
42 2005-06-30 14 3 2
43 2005-08-31 4 4 3
44 2005-09-30 15 3 4
45 2005-04-30 13 4 1
Observations 18
stays because total_obs
> 100 and it is in cosine_ntile_ntile
= 1. The other observations which fail on the first condition - i.e. being < 100 remain also (or not action taken).
Upvotes: 1
Views: 56
Reputation: 886938
We can create the filter
and negate (!
)
library(dplyr)
df2 <- df1 %>%
filter(!(total_obs > 100 & cosine_ntile_ntile != 1))
-checking the output
df2 %>%
filter(total_obs > 100, cosine_ntile_ntile != 1)
# A tibble: 0 x 4
# … with 4 variables: filing_date_EoM <date>, total_obs <int>, cosine_ntile <dbl>, cosine_ntile_ntile <int>
Or using subset
from base R
subset(df1, !(total_obs > 100 & cosine_ntile_ntile != 1))
Upvotes: 1