Reputation: 1587
I have the following dataframe
mydf<- data.frame(c(1,1,3,4,5,6),
c(1,1,3,10,20,30),
c(1,10,20,20,15,10))
colnames(mydf)<-c( "ID1","ID2", "ID3")
Let's count how many instances of mydf$ID1
and mydf$ID2
we have where they appear twice
tmp<-plyr::ddply(mydf,c("ID1","ID2"),nrow)
tmp<-tmp[tmp$V1==2,]
This has the following output (there's two identical instances)
ID1 ID2 V1
1 1 1 2
Now I want to extract the remaining fields, i.e. ID3
, of these two instances:
asd<-mydf[ (mydf$ID1 %in% tmp$ID1) & (mydf$ID1 %in% tmp$ID2), ]
This gives me the following
ID1 ID2 ID3
1 1 1 1
2 1 1 10
This is exactly as desired and what I wanted, so all good.
However, my real data set doesn't only contain 3 vectors like mydf
, but 300. And each vector is not just 6 elements long, but +50.000
When I run the above method on my data, then I get instances of entries that occur not only 2 times, but also 1, 3, 4 and so on. So that means something is wrong.
Is it because the above method cannot be scaled up to large data? Or is it because the method I outlined above is inherently wrong?
Unfortunately, due to the large size of my data, I cannot post it here.
UPDATE: Having a list with 3 vectors was just chosen as an example. In reality my data has many vectors, but I still want to group them based on duplicates of two vectors. An example:
library(dplyr)
mydf<- data.frame(c(1, 1, 3, 4, 5, 6,0),
c(22,22, 3,10,20,30,0),
c(2, 10,20,20,15,10,0),
c(3, 10,20,20,15,10,0),
c(4, 10,20,20,15,10,15),
c(5, 99,98,98,97,99,97))
colnames(mydf)<-c( "ID1","ID2", "ID3", "ID4", "ID5", "ID6")
(mydf %>% group_by(ID5, ID6) %>% filter(n()==2))
As desired, this produces
ID1 ID2 ID3 ID4 ID5 ID6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 22 10 10 10 99
2 3 3 20 20 20 98
3 4 10 20 20 20 98
4 5 20 15 15 15 97
5 6 30 10 10 10 99
6 0 0 0 0 15 97
Upvotes: 1
Views: 76
Reputation: 38520
A base R solution is to use ave
to calculate the duplicate counts, extract the counts, and subset according to which ones are pairwise duplicates.
mydf[ave(mydf$ID3, mydf$ID1, mydf$ID2, FUN=length) == 2, ]
ID1 ID2 ID3
1 1 1 1
2 1 1 10
To extend the grouping to many variables, you can replace the explicit listing of ID1 and ID2 with the output of Reduce
and interaction
like this:
mydf[ave(mydf$ID3, Reduce(interaction, mydf[-3]), FUN=length) == 2, ]
ID1 ID2 ID3
1 1 1 1
2 1 1 10
Here, Reduce(interaction, mydf[-3])
constructs a factor variable that is the interaction of ID1 and ID2. This will happily take as many vectors as you provide.
Upvotes: 2
Reputation: 887981
We can do this in easier way with dplyr
library(dplyr)
mydf %>%
group_by(ID1, ID2) %>%
filter(n()==2)
If there are more than 2 columns as grouping variables, we can use the group_by_
mydf %>%
group_by_(.dots = names(mydf)[1:2]) %>%
filter(n()==2)
Upvotes: 3