user113156
user113156

Reputation: 7107

filtering with an if statement and conditions: dplyr pipe

I am trying to filter my data such that, in some dates I have too many observations and I want to cut a few of them out. For example in the data below I have for 2005-03-31 164 observations which distorts my results somewhat. What I have done is compute some ranking method cosine_ntile which goes from 1:5. For the observations which has "too many" observations I am only interested in the top ntile, which is why I grouped by cosine_ntile and created more ntiles called cosine_ntile_ntile.

Now I am only interested in ntile 1 of cosine_ntile_ntile if the number of observations exceed 100 for example. What I tried is filter(total_obs > 100 & cosine_ntile_ntile == 1) which is not giving the desired output.

I am trying to filter if total_obs is greater than 100, then remove the observatios which have a score of 2, 3, 4 or 5 in the cosine_ntile_ntile column.

Data looks like:

# A tibble: 200 x 4
   filing_date_EoM total_obs cosine_ntile cosine_ntile_ntile
   <date>              <int>        <dbl>              <int>
 1 2005-07-31              6            1                  4
 2 2005-09-30             15            4                  2
 3 2005-03-31            164            1                  4
 4 2005-12-31             14            2                  3
 5 2005-11-30              5            5                  4
 6 2005-03-31            164            4                  3
 7 2005-12-31             14            5                  5
 8 2005-12-31             14            2                  4
 9 2005-03-31            164            1                  5
10 2005-02-28             17            3                  3

Data:

structure(list(filing_date_EoM = structure(c(12995, 13056, 12873, 
13148, 13117, 12873, 13148, 13148, 12873, 12842, 12964, 12934, 
12873, 12964, 12842, 13117, 12873, 12873, 12903, 13148, 13117, 
12873, 12873, 13056, 12873, 12934, 12873, 12842, 12873, 12842, 
12814, 12873, 12995, 12873, 12873, 13056, 12873, 12903, 12873, 
13148, 13056, 12964, 13026, 13056, 12903, 12903, 12873, 12903, 
12873, 12873, 12873, 12873, 12842, 13148, 12842, 13026, 12934, 
13117, 13026, 12873, 13026, 13056, 12873, 12903, 12873, 12873, 
12873, 13117, 12873, 12873, 12873, 12964, 12873, 13148, 12873, 
12903, 12873, 12873, 13056, 12873, 13087, 12873, 12873, 12873, 
12903, 12873, 12873, 12873, 12873, 12873, 12873, 12873, 13056, 
13026, 12873, 12873, 12903, 12903, 13117, 12873, 12873, 12873, 
12873, 12903, 12964, 12873, 12873, 13056, 12873, 12873, 12995, 
12873, 12995, 12873, 12873, 12873, 12873, 12814, 12873, 12873, 
12995, 12873, 12873, 13148, 12814, 12873, 12873, 12903, 12873, 
12873, 12842, 12842, 12873, 13148, 12873, 12873, 13148, 12873, 
12842, 13056, 12964, 13056, 13148, 12873, 12873, 12873, 12903, 
12964, 12873, 12873, 12873, 13056, 12964, 12873, 12995, 12873, 
13148, 13148, 12934, 12873, 12873, 13026, 12873, 12873, 12873, 
12842, 12873, 12873, 12873, 12873, 12873, 13148, 12903, 12873, 
12873, 12873, 12873, 12873, 12873, 12873, 13056, 12873, 12814, 
12873, 12964, 12842, 12842, 12873, 13056, 12842, 12873, 12873, 
12873, 12873, 12873, 13117, 13026, 12873, 12903, 12873), class = "Date"), 
    total_obs = c(6L, 15L, 164L, 14L, 5L, 164L, 14L, 14L, 164L, 
    17L, 13L, 5L, 163L, 13L, 17L, 5L, 164L, 164L, 13L, 14L, 6L, 
    164L, 164L, 15L, 163L, 5L, 164L, 17L, 164L, 17L, 4L, 164L, 
    5L, 164L, 164L, 16L, 164L, 13L, 164L, 15L, 15L, 14L, 4L, 
    15L, 13L, 12L, 164L, 13L, 164L, 164L, 163L, 163L, 17L, 14L, 
    17L, 4L, 5L, 5L, 4L, 164L, 5L, 16L, 164L, 13L, 164L, 164L, 
    164L, 5L, 164L, 163L, 164L, 13L, 164L, 15L, 164L, 13L, 164L, 
    164L, 15L, 164L, 3L, 164L, 164L, 164L, 12L, 164L, 164L, 164L, 
    164L, 164L, 163L, 164L, 15L, 4L, 164L, 164L, 12L, 13L, 6L, 
    164L, 164L, 163L, 163L, 13L, 14L, 163L, 164L, 15L, 163L, 
    164L, 6L, 164L, 6L, 164L, 164L, 163L, 164L, 4L, 164L, 163L, 
    6L, 164L, 164L, 14L, 4L, 164L, 164L, 13L, 163L, 163L, 17L, 
    17L, 164L, 15L, 164L, 164L, 14L, 164L, 17L, 15L, 14L, 16L, 
    14L, 164L, 164L, 163L, 13L, 13L, 164L, 164L, 164L, 15L, 13L, 
    164L, 5L, 164L, 15L, 14L, 5L, 164L, 164L, 5L, 164L, 163L, 
    164L, 17L, 164L, 163L, 164L, 164L, 164L, 15L, 13L, 164L, 
    163L, 164L, 164L, 163L, 164L, 164L, 15L, 163L, 4L, 164L, 
    14L, 17L, 17L, 164L, 15L, 17L, 164L, 164L, 164L, 163L, 164L, 
    5L, 5L, 164L, 12L, 163L), cosine_ntile = c(1, 4, 1, 2, 5, 
    4, 5, 2, 1, 3, 2, 3, 5, 5, 5, 2, 2, 2, 4, 4, 1, 2, 1, 4, 
    5, 1, 1, 1, 1, 3, 1, 3, 2, 1, 3, 1, 4, 2, 1, 1, 4, 3, 4, 
    3, 4, 5, 2, 2, 2, 1, 5, 5, 5, 4, 3, 2, 2, 5, 2, 3, 1, 1, 
    1, 2, 2, 2, 3, 5, 1, 5, 4, 2, 3, 1, 2, 1, 3, 3, 3, 1, 5, 
    1, 4, 3, 5, 1, 1, 4, 4, 4, 5, 4, 2, 5, 2, 1, 5, 4, 1, 3, 
    1, 5, 5, 2, 1, 5, 4, 4, 5, 4, 1, 1, 3, 1, 2, 5, 2, 4, 1, 
    5, 1, 1, 4, 3, 5, 1, 2, 2, 5, 5, 2, 2, 4, 1, 2, 2, 3, 1, 
    2, 3, 3, 1, 2, 4, 1, 5, 1, 5, 2, 1, 3, 3, 4, 3, 4, 2, 1, 
    4, 4, 3, 4, 3, 2, 5, 3, 2, 3, 5, 2, 2, 4, 1, 2, 1, 5, 3, 
    2, 5, 1, 4, 4, 5, 3, 1, 3, 2, 2, 2, 2, 4, 2, 4, 2, 5, 1, 
    4, 1, 1, 5, 5), cosine_ntile_ntile = c(4L, 2L, 4L, 3L, 4L, 
    3L, 5L, 4L, 5L, 3L, 4L, 3L, 3L, 4L, 1L, 1L, 3L, 1L, 3L, 1L, 
    5L, 4L, 3L, 2L, 2L, 4L, 5L, 2L, 2L, 1L, 1L, 5L, 4L, 3L, 4L, 
    1L, 1L, 3L, 4L, 4L, 3L, 2L, 3L, 4L, 1L, 2L, 4L, 1L, 2L, 2L, 
    2L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 1L, 4L, 4L, 5L, 2L, 
    1L, 4L, 1L, 1L, 2L, 4L, 3L, 4L, 2L, 3L, 2L, 2L, 5L, 3L, 1L, 
    2L, 2L, 4L, 1L, 5L, 1L, 4L, 2L, 1L, 3L, 2L, 3L, 3L, 4L, 1L, 
    5L, 1L, 2L, 3L, 4L, 4L, 3L, 1L, 2L, 2L, 1L, 1L, 5L, 1L, 2L, 
    3L, 5L, 3L, 2L, 1L, 5L, 3L, 2L, 2L, 2L, 2L, 2L, 5L, 1L, 3L, 
    3L, 5L, 4L, 4L, 5L, 2L, 5L, 4L, 1L, 4L, 1L, 1L, 2L, 5L, 2L, 
    3L, 3L, 5L, 3L, 2L, 1L, 1L, 2L, 4L, 2L, 5L, 2L, 3L, 4L, 1L, 
    4L, 1L, 2L, 4L, 1L, 3L, 1L, 5L, 3L, 2L, 1L, 5L, 1L, 4L, 3L, 
    2L, 3L, 1L, 5L, 1L, 2L, 1L, 2L, 3L, 1L, 5L, 3L, 1L, 2L, 3L, 
    1L, 3L, 4L, 4L, 4L, 1L, 2L, 1L, 1L, 1L, 5L, 5L, 2L, 1L, 5L
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-200L))

EDIT: Expected output:

    filing_date_EoM total_obs cosine_ntile cosine_ntile_ntile
1        2005-07-31         6            1                  4
2        2005-09-30        15            4                  2
3        2005-03-31       164            1                  4  ~Gets removed
4        2005-12-31        14            2                  3
5        2005-11-30         5            5                  4
6        2005-03-31       164            4                  3 ~Gets removed
7        2005-12-31        14            5                  5
8        2005-12-31        14            2                  4
9        2005-03-31       164            1                  5 ~Gets removed
10       2005-02-28        17            3                  3
11       2005-06-30        13            2                  4
12       2005-05-31         5            3                  3
13       2005-03-31       163            5                  3 ~Gets removed
14       2005-06-30        13            5                  4
15       2005-02-28        17            5                  1
16       2005-11-30         5            2                  1
17       2005-03-31       164            2                  3 ~Gets removed
18       2005-03-31       164            2                  1 ~STAYS
19       2005-04-30        13            4                  3
20       2005-12-31        14            4                  1
21       2005-11-30         6            1                  5
22       2005-03-31       164            2                  4 ~Gets removed
23       2005-03-31       164            1                  3 ~Gets removed
24       2005-09-30        15            4                  2
25       2005-03-31       163            5                  2 ~Gets removed
26       2005-05-31         5            1                  4
27       2005-03-31       164            1                  5 ~Gets removed
28       2005-02-28        17            1                  2
29       2005-03-31       164            1                  2 ~Gets removed
30       2005-02-28        17            3                  1
31       2005-01-31         4            1                  1
32       2005-03-31       164            3                  5 ~Gets removed
33       2005-07-31         5            2                  4
34       2005-03-31       164            1                  3 ~Gets removed
35       2005-03-31       164            3                  4 ~Gets removed
36       2005-09-30        16            1                  1
37       2005-03-31       164            4                  1 ~STAYS
38       2005-04-30        13            2                  3
39       2005-03-31       164            1                  4 ~Gets removed
40       2005-12-31        15            1                  4
41       2005-09-30        15            4                  3
42       2005-06-30        14            3                  2
43       2005-08-31         4            4                  3
44       2005-09-30        15            3                  4
45       2005-04-30        13            4                  1

Observations 18 stays because total_obs > 100 and it is in cosine_ntile_ntile = 1. The other observations which fail on the first condition - i.e. being < 100 remain also (or not action taken).

Upvotes: 1

Views: 56

Answers (1)

akrun
akrun

Reputation: 886938

We can create the filter and negate (!)

library(dplyr)
df2 <-  df1 %>%
    filter(!(total_obs > 100 & cosine_ntile_ntile != 1))

-checking the output

df2 %>%
   filter(total_obs > 100, cosine_ntile_ntile != 1)
# A tibble: 0 x 4
# … with 4 variables: filing_date_EoM <date>, total_obs <int>, cosine_ntile <dbl>, cosine_ntile_ntile <int>

Or using subset from base R

subset(df1, !(total_obs > 100 & cosine_ntile_ntile != 1))

Upvotes: 1

Related Questions