Reputation: 952

R selecting all rows from a data frame that don't appear in another

I'm trying to solve a tricky R problem that I haven't been able to solve via Googling keywords. Specifically, I'm trying to take a subset one data frame whose values don't appear in another. Here is an example:

> test
      number    fruit     ID1  ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
> test2
      number    fruit     ID1   ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"

I have two data frames, test and test2, and the goal is to select all entire rows in test2 that don't appear in test, even though some of the values may be the same.

The output I want would look like:

item1 "number1" "papayas" "22"  "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples"  "123" "13"

There may be an arbitrary amount of rows or columns, but in my specific case, one data frame is a direct subset of the other.

I've used the R subset(), merge() and which() functions extensively, but couldn't figure out how to use these in combination, if it's possible at all, to get what I want.

edit: Here is the R code I used to generate these two tables.

test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))

test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")

test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13)  ))

test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")

Thanks in advance!

Upvotes: 19

Answers (7)

LMc

Reputation: 18632

library(vctrs)

vec_set_difference(test2, test)

You can also use vec_set_symmetric_difference() which is commutative.

Upvotes: 1

Procyon_lotor

Reputation: 31

Using the package dplyr, you can also use anti_join.

missing.species <- anti_join(test2, test, by = NULL)

It will return rows of test2 that do not have a match in test. By explicit the variables to join by. If NULL, the function will use all variables in common across test and test2.

Upvotes: 3

usct01

Reputation: 898

There are two ways to solve this, using data.table and sqldf

library(data.table)
test<- fread('
item number fruit ID1 ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
')
test2<- fread('
item number fruit ID1 ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"
')

data.table approach, this enables you to select which columns you want to compare

setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

Sql approach

sqldf('select * from test except select * from test2')
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

Upvotes: 4

A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's another approach, but I'm not sure how well it would scale.

test2[!apply(test2, 1, paste, collapse = "") %in% 
        apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"

This would not delete all duplicates. Compare, for example, if test2 had duplicates:

test2 <- rbind(test2, test2[1:3, ])

## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#       number    fruit     ID1   ID2 
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
  apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

Upvotes: 1

Ricardo Saporta

Reputation: 55350

The following should get you there:

rows <- unique(unlist(mapply(function(x, y) 
          sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]

What's happening here is:

mapply is used to do a column-wise comparison between the two datasets.
It uses setdiff to find any item which are in the former but not the latter
which identifies which row of the former is not present.
unique(unlist(....)) grabs all unique rows
Then we use that as a filter to the former, ie test2

Results:

       number   fruit ID1 ID2
item1 number1 papayas  22  33
item3 number3 peaches 441  25
item4 number4  apples 123  13

edit:

Make sure that your test & test2 are data.frames and not matrices, since mapply iterates over each element of a matrix, but over each column of a data.frame

test  <- as.data.frame(test,  stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)

Upvotes: 3

Matthew Plourde

Reputation: 44614

Here's another way:

x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#        number   fruit ID1 ID2
# item1 number1 papayas  22  33
# item3 number3 peaches 441  25
# item4 number4  apples 123  13

Edit: modified to preserve row names.

Upvotes: 17

Hong Ooi

Reputation: 57686

Make a new row-ID column in test2, merge the data frames, and select those rows whose IDs aren't in the merged result.

test2 <- cbind(test2, id=seq_len(nrow(test2)))

matches <- merge(test1, test2)$id

test2 <- test2[-matches, ]

Upvotes: 1

R selecting all rows from a data frame that don&#39;t appear in another

Answers (7)

Results:

edit:

Related Questions

R selecting all rows from a data frame that don't appear in another