Reputation: 1848
I have two data frames(df and df1). df1 is subset of df. I want to get a data frame which is complement of df1 in df, i.e. return rows of the first data set which are not matched in the second. For example let,
data frame df:
heads
row1
row2
row3
row4
row5
data frame df1:
heads
row3
row5
Then the desired output df2 is:
heads
row1
row2
row4
Upvotes: 50
Views: 41191
Reputation: 521467
Late answer, but for another option we can try doing a formal SQL anti join, using the sqldf
package:
library(sqldf)
sql <- "SELECT t1.heads
FROM df t1 LEFT JOIN df1 t2
ON t1.heads = t2.heads
WHERE t2.heads IS NULL"
df2 <- sqldf(sql)
The sqldf
package can be useful for those problems which are easily phrased using SQL logic, but perhaps less easily phrased using base R or another R package.
Upvotes: 3
Reputation: 92292
You could also do some type of anti join with data.table
s binary join
library(data.table)
setkey(setDT(df), heads)[!df1]
# heads
# 1: row1
# 2: row2
# 3: row4
EDIT: Starting data.table v1.9.6+ we can join data.tables without setting keys while using on
setDT(df)[!df1, on = "heads"]
EDIT2: Starting data.table v1.9.8+ fsetdiff
was introduced which is basically a variation of the solution above, just over all the column names of the x
data.table, e.g. x[!y, on = names(x)]
. If all
set to FALSE
(the default behavior), then only unique rows in x
will be returned. For the case of only one column in each data.table the following will be equivalent to the previous solutions
fsetdiff(df, df1, all = TRUE)
Upvotes: 80
Reputation: 14968
dplyr also has setdiff()
which will get you the
setdiff(bigFrame, smallFrame)
gets you the extra records in the first table.
so for the OP's example the code would read setdiff(df, df1)
dplyr has a lot of great functionality: for a quick easy guide see here.
Upvotes: 5
Reputation: 3587
Another option by creating a function negate_match_df
by manipulating the code of match_df
of plyr
package.
library(plyr)
negate_match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[!keys$x %in% keys$y, , drop = FALSE]
}
Data
df <- read.table(text ="heads
row1
row2
row3
row4
row5",header=TRUE)
df1 <- read.table(text ="heads
row3
row5",header=TRUE)
Output
negate_match_df(df,df1)
Upvotes: 2
Reputation: 3991
Another option, using base R and the setdiff
function:
df2 <- data.frame(heads = setdiff(df$heads, df1$heads))
setdiff
functions exactly as you would imagine; take both arguments as sets, and remove all items in the second from the first.
I find setdiff
more readable tahtn %in%
and prefer not to require additional libraries when I don't need them, but which answer you use is largely a question of personal taste.
Upvotes: 8
Reputation: 7796
Try the %in%
command and reverse it with !
df[!df$heads %in% df1$heads,]
Upvotes: 28
Reputation: 887223
Try anti_join
from dplyr
library(dplyr)
anti_join(df, df1, by='heads')
Upvotes: 44