vitor
vitor

Reputation: 1250

Filter out rows from one data.frame that are present in another data.frame

Suppose I have a larger data.frame and a smaller one. If the smaller one is contained inside the larger one, how can I subtract the rows of the smaller data.frame, leaving a result with the difference:

Larger - Smaller

Example:

Small data.frame:

     ID       CSF1PO CSF1PO.1 D10S1248 D10S1248.1 D12S391 D12S391.1
203079_BA_M     10       11       14         16      -9        -9
203079_BA_F      8       12       14         17      -9        -9
203080_BA_M     10       12       13         13      -9        -9

Big data.frame:

      ID      CSF1PO CSF1PO.1 D10S1248 D10S1248.1 D12S391 D12S391.1
203078_MG_M     -9       -9       15         15      18        20
203078_MG_F     -9       -9       14         15      17        19
203079_BA_M     10       11       14         16      -9        -9
203079_BA_F      8       12       14         17      -9        -9
203080_BA_M     10       12       13         13      -9        -9
203080_BA_F     10       11       14         16      -9        -9
203081_MG_M     10       12       14         16      -9        -9
203081_MG_F     11       12       15         16      -9        -9
203082_MG_M     11       11       13         15      -9        -9
203082_MG_F     11       11       13         14      -9        -9

The small data.frame corresponds to the rows 3, 4 and 5 of the larger data.frame.

Upvotes: 25

Views: 40846

Answers (3)

KMuy
KMuy

Reputation: 31

setdiff is fine when no. of columns and types match, but a problem when small dataframe has a subset of columns from the big dataframe .

Alternative is anti-join in dplyr, which gives you all rows in the big dataframe that are not in the small dataframe. It keeps the columns in the big dataframe which is what you need, not combining with the small dataframe columns like other joins do. See link http://rpubs.com/williamsurles/293454

You should change ID (if its a column name) to character else R will coerce to character by default and give you a warning, but having given you a correct result. I got the same answer as setdiff() using this:

small_df$ID <- as.character(small_df$ID)
big_df$ID   <- as.character(big_df$ID)
result      <- anti_join(big_df,small_df)

Result =

ID         CSF1P0 CSF1P0.1 D10S1248 D10S1248.1 D12S391 D12S391.1
203078_MG_M    -9   -9      15       15         18       20 
203078_MG_F    -9   -9      14       15         17       19
203080_BA_F    10   11      14       16         -9       -9  
203081_MG_M    10   12      14       16         -9       -9  
203081_MG_F    11   12      15       16         -9       -9 
203082_MG_M    11   11      13       15         -9       -9  
203082_MG_F    11   11      13       14         -9       -9

Upvotes: 1

leerssej
leerssej

Reputation: 14958

In dplyr:

library(dplyr)

setdiff(BigDF, SmallDF)

More Info: Hadley's dply cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Concise Set Operations functions with examples http://rpackages.ianhowson.com/cran/dplyr/man/setops.html (But the entire Grammar of Data Manipulation is a great resource overall)

And although the below is not in direct answer to your question - it is frequently related for me (and has been very useful)

If you wish to capture the new changes that have occured between a new dataframe and a previous version of the same dataframe (inside the same records) you will want to make your code look as below:

setdiff(NewDF, OldDF)

Upvotes: 18

Ferdinand.kraft
Ferdinand.kraft

Reputation: 12819

Try this:

BigDF[ !(BigDF$ID %in% SmallDF$ID), ]

Upvotes: 21

Related Questions