Canovice
Canovice

Reputation: 10163

Group_by then filter with dplyr

First of all, please let me know if what I'm doing is poor use of dplyr because I'm not sure if I'm approaching this in the best way. I have the following dataframe:

mydf = data.frame(user = c(7,7,7,7,7,7,7,8,8,8,8,8,8),
                  col1 = c('0','0','1','1','0','3','NULL','3','3','0','1','0','0'),
                  col2 = runif(n=13),
                  col3 = letters[1:13], 
                  stringsAsFactors = FALSE)

> mydf
   user col1      col2 col3
1     7    0 0.7607907    a
2     7    0 0.1580448    b
3     7    1 0.8063540    c
4     7    1 0.7331512    d
5     7    0 0.2433631    e
6     7    3 0.2357065    f
7     7 NULL 0.4864172    g
8     8    3 0.6806089    h
9     8    3 0.2229874    i
10    8    0 0.6187911    j
11    8    1 0.7617177    k
12    8    0 0.5884821    l
13    8    0 0.4985750    m

The filtering I would like to do is a bit wordy, but I'll try - I would like to filter the dataframe by removing all rows where col1 == '0' if that row occurs AFTER the first row for that user where col1 == '1'. (bolded indicates i messed up the original question, and switched the 0 and 1).

For example, for user 7, the 3rd row has col1 == '1', so I'd like to filter all rows after row 3 where col1 == '0' (in this case, only row 5). Then, for user 8, the 11th row is the first row for this user where col1 == '1', so I'd then like to filter rows 12 and 13, since col1 == '0'.

My final output should be something like this:

> mydf
   user col1      col2 col3
1     7    0 0.7607907    a
2     7    0 0.1580448    b
3     7    1 0.8063540    c
4     7    1 0.7331512    d
6     7    3 0.2357065    f
7     7 NULL 0.4864172    g
8     8    3 0.6806089    h
9     8    3 0.2229874    i
10    8    0 0.6187911    j
11    8    1 0.7617177    k

I've tried the following, but it didn't work. I thought adding a rownums column, then grouping by user, and then filtering how I've described would work. My thoughts are that there is something wrong with my filter call:

mydf %>%
  mutate(rownums = 1:nrow(mydf)) %>%
  group_by(user) %>%
  filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))

# A tibble: 9 x 5
# Groups:   col0 [2]
   user  col1      col2  col3 rownums
  <dbl> <chr>     <dbl> <chr>   <int>
1     7     0 0.2088034     a       1
2     7     0 0.2081894     b       2
3     7     1 0.1825428     c       3
4     7     1 0.2143353     d       4
5     7     3 0.1979774     f       6
6     7  NULL 0.2990799     g       7
7     8     3 0.7808038     h       8
8     8     3 0.1694272     i       9
9     8     1 0.1526450     k      11

The difference between this output, and the correct output, is that this output incorrectly also filtered row 10 of the original dataframe.

Any help with this is appreciated!

EDIT - I'm particularly curious if group_by() %>% filter() is bad practice in R with regards to dplyr. 99% of my group_by() are followed by summarize(), which obviously makes more sense.

EDIT2 - I think I've got it!

mydf %>%
  group_by(col0) %>%
  mutate(rownums = 1:length(col0)) %>%
  filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))

Simply flipping the order of the mutate() and group_by() calls, and tweaking the mutate() call a bit, looks to have gotten it done. I'm open to hearing a better approach though.

Upvotes: 5

Views: 8337

Answers (4)

liborm
liborm

Reputation: 2724

There is a cumany function, which is useful for these sequential conditions, like this:

mydf %>%
  group_by(user) %>% 
  mutate(seen_one = cumany(col1 == "1")) %>%
  filter(!seen_one | col1 != "0")

That is mark all rows after "1" has been in the 'stream' with seen_one, and then keep the rows where one of the conditions is not met. (Semantics of filter require to invert the condition to 'get rid' of the rows, !(A & B) == !A | !B.)

Upvotes: 4

Aramis7d
Aramis7d

Reputation: 2496

alternately, building on the direction @liborm's answer provides:

mydf %>%
  group_by(user) %>%
  mutate(k = cumany(col1 == '0'), j = cumany(col1 == '1')) %>%
  filter(!(col1 == 0 & k == TRUE & j == TRUE)) %>%
  select(-k,-j)

returns:

    user  col1  col2  col3
   <dbl> <chr> <dbl> <chr>
 1     7     0     1     a
 2     7     0     1     b
 3     7     1     0     c
 4     7     1     0     d
 5     7     3     0     f
 6     7  NULL     1     g
 7     8     3     0     h
 8     8     3     1     i
 9     8     0     1     j
10     8     1     0     k

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388807

Could be solved by updating your attempt a little :

library(dplyr)
mydf %>%
   group_by(user) %>%
   filter(col1 != 0 | row_number() < which.max(col1 == 1))


#    user   col1        col2  col3
#   <dbl> <chr>       <dbl> <chr>
# 1     7     0 0.756522673     a
# 2     7     0 0.168314555     b
# 3     7     1 0.977254798     c
# 4     7     1 0.722721694     d
# 5     7     3 0.407849378     f
# 6     7  NULL 0.245335151     g
# 7     8     3 0.003423735     h
# 8     8     3 0.191716738     i
# 9     8     0 0.626846893     j
#10     8     1 0.546459621     k

Using filter we select all the rows where col1 is not equal to 0 or those rows where the current row is less than the index of first occurrence of 1 for that group.

Upvotes: 2

Sotos
Sotos

Reputation: 51582

Here is an idea via dplyr

library(dplyr)

df %>% 
 group_by(user) %>% 
 mutate(id1 = row_number(), new_col = max(which(col1 == 1)+1)) %>% 
 filter(!(col1 == 0 & id1 >= new_col))

which gives,

# A tibble: 10 x 6
# Groups:   user [2]
    user  col1       col2  col3   id1 new_col
   <dbl> <chr>      <dbl> <chr> <int>   <dbl>
 1     7     0 0.54742608     a     1       5
 2     7     0 0.89271859     b     2       5
 3     7     1 0.48999057     c     3       5
 4     7     1 0.17163211     d     4       5
 5     7     3 0.96146770     f     6       5
 6     7  NULL 0.31368382     g     7       5
 7     8     3 0.82051455     h     1       5
 8     8     3 0.30705440     i     2       5
 9     8     0 0.18545358     j     3       5
10     8     1 0.04834678     k     4       5

Upvotes: 1

Related Questions