Reputation: 427
I have a data set called df1
. It has an ID
column and some other columns, for example Date(Posixt)
, Price
, Sentiment
(both numeric), etc.
I have two subsets of df1
, which are df2
and df3
(there might be some overlaps). I want to remove all the rows of df2
and df3
from df1
(or df1-(df2 U df3)
, U
is union).
I have tried subset, but it is really not easy to get the select
parameter, since it is not a direct condition like ID!=100
.
Of course, loop must be a solution to this problem, but it takes too much time and looks really ugly.
Is there is a certain way like a vector or matrix operation that can realize this quickly and concisely?
Upvotes: 1
Views: 781
Reputation: 3689
You can use boolean indexing instead.
data = data.frame(id=1:20,value=rnorm(20))
data.1 = data[sample(nrow(data), 5), ]
data.2 = data[sample(nrow(data), 5), ]
Point is to keep only ids that are NOT (the !
operator) in either of the subset's ids. The pipe |
character is the OR
statement--if id is in either of the two subsets, we eliminate.
data[!(data$id %in% data.1$id | data$id %in% data.2$id),]
Upvotes: 1
Reputation: 69201
You can use the [
function to index directly into your df1 object instead of using subset()
. We just need to create a logical vector that has the criteria we want. For that, we'll use the %in%
function and some negation. This seems to do the trick:
df1 <- data.frame(id = 1:10, foo = rnorm(10), bar = runif(10))
#Randomly sample three rows to create df2 and df3
set.seed(2)
df2 <- df1[sample(1:10, 3, FALSE), ]
df3 <- df1[sample(1:10, 3, FALSE), ]
#what IDs are in df2 and df3?
c(df2$id, df3$id)
#--
[1] 2 7 5 2 9 8
#OK, so we want to get id's 1,3,4,6,10
df1[!(df1$id %in% c(df2$id, df3$id)),]
#--
id foo bar
1 1 -0.5656801 0.8613120
3 3 0.1252706 0.5147147
4 4 1.3532248 0.8224739
6 6 0.3225545 0.9746704
10 10 2.1502097 0.9939075
Upvotes: 1