spore234
spore234

Reputation: 3640

select multiple observations by value

I have a data frame with multiple observations for each ID like this:

edit: updated data frame

df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
             V2=rnorm(9))

> df
  ID V1          V2
1  1  A  1.57707547
2  1  B -0.76022296
3  1  C -0.82693346
4  2  A  1.80888747
5  2  A -0.53173950
6  3  B -1.18705727
7  3  B  0.04325324
8  3  C -0.33361802
9  4  A -0.02358198

Now I want to select all rows per ID in the following way:

In my example I want to have this:

    ID V1          V2
  1  1  A  1.57707547
  2  2  A  1.80888747
  3  2  A -0.53173950
  4  3  B -1.18705727
  5  3  B  0.04325324
  6  4  A -0.02358198

I'd also like to see a dplyr solution if applicable.

Upvotes: 3

Views: 1168

Answers (2)

akrun
akrun

Reputation: 887048

Here is one option with dplyr. We group by 'ID' column, filter the rows that have either 'A' or 'B' rows, do another filter to check the number of unique elements in 'V1' (n_distinct(V1)). If it is greater than 1, and the element is 'A' we select it (n_distinct(V1)>1 & V1=='A') or we select all the length of unique elements as 1.

 library(dplyr)
 df %>% 
    group_by(ID) %>% 
    filter(V1 %in% c('A', 'B'))%>%
    filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Maybe we can use a modified version using single filter. We check if the number of unique elements in 'V1' is greater than 1 and if none of the elements are 'A' (all(V1!='A')) and if the element is 'B', we select that row, or if the number of distinct elements are greater than 1 and there is a 'A' element in it, select that row or if the number of unique element is 1 and the element is either 'A' or 'B', select the row.

 df %>% 
   group_by(ID) %>% 
   filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 & 
            V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
 #   ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Or a bit more compact one would be (Inspired from @MichaelChirico's post). We group by 'ID', and filter either V1 with 'A' rows or 'B' and without any 'A' rows.

 df %>%
     group_by(ID) %>%
     filter(V1=='A'|V1=='B'&!any(V1=='A'))
 #  ID V1          V2
 #1  1  A  1.57707547
 #2  2  A  1.80888747
 #3  2  A -0.53173950
 #4  3  B -1.18705727
 #5  3  B  0.04325324
 #6  4  A -0.02358198

Upvotes: 3

MichaelChirico
MichaelChirico

Reputation: 34703

In data.table, this can be readily done with:

library(data.table); setDT(df)
df[df[,.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))],by=ID]$V1]

The inner df call picks the indices (.I) where either a) V1 is A or b) V1 is B and there are no other A element among the V1 for that ID (that is, this is done by ID); $V1 extracts these indices and passes them back to the outer df.

(It may be confusing that we extract V1 because there's a column in the original table called V1, but the V1 we extract is different; to see this, consider this alternative where we name the outcome variable:

df[df[,.(ind=.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))]),by=ID]$ind]

Here, we name the index variable ind, so we have to extract ind instead of V1)

Upvotes: 2

Related Questions