BillyJean
BillyJean

Reputation: 1587

Subsetting list-elements based on conditional

I have the following dataframe

mydf<- data.frame(c(1,1,3,4,5,6),
                  c(1,1,3,10,20,30),
                  c(1,10,20,20,15,10))
colnames(mydf)<-c( "ID1","ID2", "ID3")

Let's count how many instances of mydf$ID1 and mydf$ID2 we have where they appear twice

tmp<-plyr::ddply(mydf,c("ID1","ID2"),nrow)
tmp<-tmp[tmp$V1==2,]

This has the following output (there's two identical instances)

  ID1 ID2 V1
1   1   1  2

Now I want to extract the remaining fields, i.e. ID3, of these two instances:

asd<-mydf[ (mydf$ID1 %in% tmp$ID1) & (mydf$ID1 %in% tmp$ID2), ]

This gives me the following

  ID1 ID2 ID3
1   1   1   1
2   1   1  10

This is exactly as desired and what I wanted, so all good.


However, my real data set doesn't only contain 3 vectors like mydf, but 300. And each vector is not just 6 elements long, but +50.000

When I run the above method on my data, then I get instances of entries that occur not only 2 times, but also 1, 3, 4 and so on. So that means something is wrong.

Is it because the above method cannot be scaled up to large data? Or is it because the method I outlined above is inherently wrong?

Unfortunately, due to the large size of my data, I cannot post it here.


UPDATE: Having a list with 3 vectors was just chosen as an example. In reality my data has many vectors, but I still want to group them based on duplicates of two vectors. An example:

library(dplyr)


mydf<- data.frame(c(1,  1, 3, 4, 5, 6,0),
                  c(22,22, 3,10,20,30,0),
                  c(2, 10,20,20,15,10,0),
                  c(3, 10,20,20,15,10,0),
                  c(4, 10,20,20,15,10,15),
                  c(5, 99,98,98,97,99,97))
colnames(mydf)<-c( "ID1","ID2", "ID3", "ID4", "ID5", "ID6")

(mydf %>% group_by(ID5, ID6) %>% filter(n()==2))

As desired, this produces

    ID1   ID2   ID3   ID4   ID5   ID6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    22    10    10    10    99
2     3     3    20    20    20    98
3     4    10    20    20    20    98
4     5    20    15    15    15    97
5     6    30    10    10    10    99
6     0     0     0     0    15    97

Upvotes: 1

Views: 76

Answers (2)

lmo
lmo

Reputation: 38520

A base R solution is to use ave to calculate the duplicate counts, extract the counts, and subset according to which ones are pairwise duplicates.

mydf[ave(mydf$ID3, mydf$ID1, mydf$ID2, FUN=length) == 2, ]
  ID1 ID2 ID3
1   1   1   1
2   1   1  10

To extend the grouping to many variables, you can replace the explicit listing of ID1 and ID2 with the output of Reduce and interaction like this:

mydf[ave(mydf$ID3, Reduce(interaction, mydf[-3]), FUN=length) == 2, ]
  ID1 ID2 ID3
1   1   1   1
2   1   1  10

Here, Reduce(interaction, mydf[-3]) constructs a factor variable that is the interaction of ID1 and ID2. This will happily take as many vectors as you provide.

Upvotes: 2

akrun
akrun

Reputation: 887981

We can do this in easier way with dplyr

library(dplyr)
mydf %>% 
    group_by(ID1, ID2) %>% 
    filter(n()==2)

If there are more than 2 columns as grouping variables, we can use the group_by_

mydf %>%
    group_by_(.dots = names(mydf)[1:2]) %>%
    filter(n()==2)

Upvotes: 3

Related Questions