Reputation: 952
I'm trying to solve a tricky R problem that I haven't been able to solve via Googling keywords. Specifically, I'm trying to take a subset one data frame whose values don't appear in another. Here is an example:
> test
number fruit ID1 ID2
item1 "number1" "apples" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples" "12" "13"
> test2
number fruit ID1 ID2
item1 "number1" "papayas" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples" "123" "13"
item5 "number3" "peaches" "44" "25"
item6 "number4" "apples" "12" "13"
item7 "number1" "apples" "22" "33"
I have two data frames, test and test2, and the goal is to select all entire rows in test2 that don't appear in test, even though some of the values may be the same.
The output I want would look like:
item1 "number1" "papayas" "22" "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples" "123" "13"
There may be an arbitrary amount of rows or columns, but in my specific case, one data frame is a direct subset of the other.
I've used the R subset(), merge() and which() functions extensively, but couldn't figure out how to use these in combination, if it's possible at all, to get what I want.
edit: Here is the R code I used to generate these two tables.
test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))
test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")
test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13) ))
test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")
Thanks in advance!
Upvotes: 19
Views: 28186
Reputation: 18632
library(vctrs)
vec_set_difference(test2, test)
You can also use vec_set_symmetric_difference()
which is commutative.
Upvotes: 1
Reputation: 31
Using the package dplyr, you can also use anti_join.
missing.species <- anti_join(test2, test, by = NULL)
It will return rows of test2 that do not have a match in test. By explicit the variables to join by. If NULL, the function will use all variables in common across test and test2.
Upvotes: 3
Reputation: 898
There are two ways to solve this, using data.table and sqldf
library(data.table)
test<- fread('
item number fruit ID1 ID2
item1 "number1" "apples" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples" "12" "13"
')
test2<- fread('
item number fruit ID1 ID2
item1 "number1" "papayas" "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples" "123" "13"
item5 "number3" "peaches" "44" "25"
item6 "number4" "apples" "12" "13"
item7 "number1" "apples" "22" "33"
')
data.table approach, this enables you to select which columns you want to compare
setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item number fruit ID1 ID2
1: item1 number1 apples 22 33
2: item3 number3 peaches 44 25
3: item4 number4 apples 12 13
Sql approach
sqldf('select * from test except select * from test2')
item number fruit ID1 ID2
1: item1 number1 apples 22 33
2: item3 number3 peaches 44 25
3: item4 number4 apples 12 13
Upvotes: 4
Reputation: 193517
Here's another approach, but I'm not sure how well it would scale.
test2[!apply(test2, 1, paste, collapse = "") %in%
apply(test, 1, paste, collapse = ""), ]
# number fruit ID1 ID2
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples" "123" "13"
This would not delete all duplicates. Compare, for example, if test2
had duplicates:
test2 <- rbind(test2, test2[1:3, ])
## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
# number fruit ID1 ID2
# item4 "number4" "apples" "123" "13"
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
apply(test, 1, paste, collapse = ""), ]
# number fruit ID1 ID2
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples" "123" "13"
# item1 "number1" "papayas" "22" "33"
# item3 "number3" "peaches" "441" "25"
Upvotes: 1
Reputation: 55350
The following should get you there:
rows <- unique(unlist(mapply(function(x, y)
sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]
What's happening here is:
mapply
is used to do a column-wise comparison between the two datasets. setdiff
to find any item which are in the former but not the latter which
identifies which row of the former is not present. unique(unlist(....))
grabs all unique rows
Then we use that as a filter to the former, ie test2
number fruit ID1 ID2
item1 number1 papayas 22 33
item3 number3 peaches 441 25
item4 number4 apples 123 13
Make sure that your test
& test2
are data.frames
and not matrices
, since mapply
iterates over each element of a matrix, but over each column of a data.frame
test <- as.data.frame(test, stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)
Upvotes: 3
Reputation: 44614
Here's another way:
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
# number fruit ID1 ID2
# item1 number1 papayas 22 33
# item3 number3 peaches 441 25
# item4 number4 apples 123 13
Edit: modified to preserve row names.
Upvotes: 17
Reputation: 57686
Make a new row-ID column in test2, merge the data frames, and select those rows whose IDs aren't in the merged result.
test2 <- cbind(test2, id=seq_len(nrow(test2)))
matches <- merge(test1, test2)$id
test2 <- test2[-matches, ]
Upvotes: 1