select multiple observations by value

Question

I have a data frame with multiple observations for each ID like this:

edit: updated data frame

df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
             V2=rnorm(9))

> df
  ID V1          V2
1  1  A  1.57707547
2  1  B -0.76022296
3  1  C -0.82693346
4  2  A  1.80888747
5  2  A -0.53173950
6  3  B -1.18705727
7  3  B  0.04325324
8  3  C -0.33361802
9  4  A -0.02358198

Now I want to select all rows per ID in the following way:

If ID has observations for "A", select these rows
If ID has observations for "B", select these rows
If ID has observations for both "A" and "B", select only rows with "A"

In my example I want to have this:

    ID V1          V2
  1  1  A  1.57707547
  2  2  A  1.80888747
  3  2  A -0.53173950
  4  3  B -1.18705727
  5  3  B  0.04325324
  6  4  A -0.02358198

I'd also like to see a dplyr solution if applicable.

akrun · Accepted Answer

Here is one option with dplyr. We group by 'ID' column, filter the rows that have either 'A' or 'B' rows, do another filter to check the number of unique elements in 'V1' (n_distinct(V1)). If it is greater than 1, and the element is 'A' we select it (n_distinct(V1)>1 & V1=='A') or we select all the length of unique elements as 1.

 library(dplyr)
 df %>% 
    group_by(ID) %>% 
    filter(V1 %in% c('A', 'B'))%>%
    filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Maybe we can use a modified version using single filter. We check if the number of unique elements in 'V1' is greater than 1 and if none of the elements are 'A' (all(V1!='A')) and if the element is 'B', we select that row, or if the number of distinct elements are greater than 1 and there is a 'A' element in it, select that row or if the number of unique element is 1 and the element is either 'A' or 'B', select the row.

 df %>% 
   group_by(ID) %>% 
   filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 & 
            V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
 #   ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Or a bit more compact one would be (Inspired from @MichaelChirico's post). We group by 'ID', and filter either V1 with 'A' rows or 'B' and without any 'A' rows.

 df %>%
     group_by(ID) %>%
     filter(V1=='A'|V1=='B'&!any(V1=='A'))
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

select multiple observations by value

Answers (2)

Related Questions