Panagiotis Goulas
Panagiotis Goulas

Reputation: 41

What does the %in% operator compares exactly in case I compare 2 data frames

So I have

df=data.frame(age=c(10,12,12,13,13,10), name=c('Maria','anders','anders','per','johanna','Maria'))

dups=df[duplicated(df),] 

What R does when I run df %in% dups

Output: FALSE FALSE

I do realise for example if I run df$name %in% dups$name

Output: TRUE TRUE TRUE FALSE FALSE TRUE

which compares every name of df with the name of dups and checks if a name is found at least once on dups. I would assume df %in% dups would check every row of df against every row of dups but that doesn't seem to be the case.

Upvotes: 2

Views: 51

Answers (1)

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

When %in% is applied to data frames, the comparison takes place column-wise.

For example

df %in% df["age"]
# [1]  TRUE FALSE

compares each column in df with the column in the one-column data frame df["age"]. Since the age column is identical in both data frames, the first value is TRUE.


For a row-wise comparison, you can use the following (complex) command:

sapply(seq(nrow(df)),
       function(i1) any(as.logical(rowSums(sapply(seq(nrow(dups)),
                                                  function(i2) df[i1, ] == dups[i2, ])))))
# [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE

Upvotes: 4

Related Questions