Reputation: 91
I cant seem to find an answer through search to this on SO. I'm trying to select a subset of a data.frame
based on four conditions (lon1, lon2, lat1 and lat2). I have a huge dissimilarity matrix that has been vectorized and the sites (lon1, lon2, lat1 and lat2) cbind
to it. Here is an example data frame:
out1 <- data.frame(lon1 = sample(1:10), lon2 = sample(1:10),
lat1 = sample(1:10), lat2 = sample(1:10),
dissimilarity = sample(seq(0,1,.1),10))
> out1
lon1 lon2 lat1 lat2 dissimilarity
1 2 6 4 4 0.6
2 4 2 1 3 1.0
3 10 9 2 6 0.0
4 3 1 10 8 0.5
5 9 5 9 1 0.8
6 5 7 5 9 0.9
7 1 8 6 7 0.2
8 8 3 8 5 0.7
9 7 4 3 10 0.3
10 6 10 7 2 0.1
out2 <- out1[c(2,5,6,8),]
lon1 lon2 lat1 lat2 dissimilarity
1 4 2 1 3 1.0
2 9 5 9 1 0.8
3 5 7 5 9 0.9
4 8 3 8 5 0.7
I tried using %in%
function a few times in this manner:
test <- out1[(out1$lon1 %in% out2$lon1) & (out1$lon2 %in% out2$lon2) &
(out1$lat1 %in% out2$lat1) & (out1$lat2 %in% out2$lat2), ]
This seems to work for the basic example I provide here. But, when I apply it to my huge data frame (with many lat
and lons
repeated) I get back a larger subset than the unqiue combinations I require. I assume because the match function in %in%
can only match a vector. So it's matching condition1 &
condition2 &
condition3 &
condition4 And thus is returning a results that gives a subset which is the same as the orginal out1
. I want to get only the case when all four values are the same for that row. This way I'll get a subset of the data for the pairwise dissimilarities I'm interested in.
Any ideas on how to subset by row based on a unique combination of four variables would be greatly appreciated.
Upvotes: 1
Views: 2780
Reputation: 118799
I think this is what you're looking for. Basically you want duplicated
function that returns what you're expecting.
out1[duplicated(rbind(out2, out1)[, 1:4])[-seq_len(nrow(out2))], ]
How does it work? First we rbind
out2
and out1
. Then call duplicated
on it. The columns that are in out2
and in out1
will be marked as TRUE
in out1
. This is because the first occurrence is on out2
and it was not duplicated there. But the second time it finds the entry, it will be in out1
and so it'll know there has been a row exactly like this before. So, it'll mark it as duplicated. We now have all duplicated entries. From this we subset only the elements of out1
by removing the first n
elements where n = nrow(out1)
. Then we subset using this logical vector on out1
.
You can go through this explanation and run the code step by step to follow-up. Here's a break-down version for working out the logic.
tt <- rbind(out2, out1)
tt.dup <- duplicated(tt[, 1:4)] # marks all duplicate rows in out1 from 1st 4 cols
tt.dup <- tt.dup[-seq_len(nrow(out2))] # remove all out2 entries (first n)
out1[tt.dup, ] # index only TRUE/duplicated elements from out1
Upvotes: 2