Reputation: 3640
I have a data frame with multiple observations for each ID like this:
edit: updated data frame
df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4), V1=c("A","B","C","A","A","B","B","C","A"),
V2=rnorm(9))
> df
ID V1 V2
1 1 A 1.57707547
2 1 B -0.76022296
3 1 C -0.82693346
4 2 A 1.80888747
5 2 A -0.53173950
6 3 B -1.18705727
7 3 B 0.04325324
8 3 C -0.33361802
9 4 A -0.02358198
Now I want to select all rows per ID in the following way:
In my example I want to have this:
ID V1 V2
1 1 A 1.57707547
2 2 A 1.80888747
3 2 A -0.53173950
4 3 B -1.18705727
5 3 B 0.04325324
6 4 A -0.02358198
I'd also like to see a dplyr
solution if applicable.
Upvotes: 3
Views: 1168
Reputation: 887048
Here is one option with dplyr
. We group by 'ID' column, filter
the rows that have either 'A' or 'B' rows, do another filter
to check the number of unique elements in 'V1' (n_distinct(V1)
). If it is greater than 1, and the element is 'A' we select it (n_distinct(V1)>1 & V1=='A'
) or we select all the length of unique elements as 1.
library(dplyr)
df %>%
group_by(ID) %>%
filter(V1 %in% c('A', 'B'))%>%
filter(n_distinct(V1)>1 & V1=='A'|n_distinct(V1)==1)
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
Maybe we can use a modified version using single filter
. We check if the number of unique elements in 'V1' is greater than 1 and if none of the elements are 'A' (all(V1!='A')
) and if the element is 'B', we select that row, or if the number of distinct elements are greater than 1 and there is a 'A' element in it, select that row or if the number of unique element is 1 and the element is either 'A' or 'B', select the row.
df %>%
group_by(ID) %>%
filter(n_distinct(V1)>1 & all(V1 !='A') & V1=='B'|n_distinct(V1)>1 &
V1=='A' |n_distinct(V1)==1 & V1 %in% c('A', 'B') )
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
Or a bit more compact one would be (Inspired from @MichaelChirico's post). We group by 'ID', and filter
either V1 with 'A' rows or 'B' and without any 'A' rows.
df %>%
group_by(ID) %>%
filter(V1=='A'|V1=='B'&!any(V1=='A'))
# ID V1 V2
#1 1 A 1.57707547
#2 2 A 1.80888747
#3 2 A -0.53173950
#4 3 B -1.18705727
#5 3 B 0.04325324
#6 4 A -0.02358198
Upvotes: 3
Reputation: 34703
In data.table
, this can be readily done with:
library(data.table); setDT(df)
df[df[,.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))],by=ID]$V1]
The inner df
call picks the indices (.I
) where either a) V1
is A
or b) V1
is B
and there are no other A
element among the V1
for that ID
(that is, this is done by
ID
); $V1
extracts these indices and passes them back to the outer df
.
(It may be confusing that we extract V1
because there's a column in the original table called V1
, but the V1
we extract is different; to see this, consider this alternative where we name the outcome variable:
df[df[,.(ind=.I[V1=="A"|(V1=="B"&!"A"%in%unique(V1))]),by=ID]$ind]
Here, we name the index variable ind
, so we have to extract ind
instead of V1
)
Upvotes: 2