Reputation: 427

R Subset a data frame with a complex condition

I have a data set called df1. It has an ID column and some other columns, for example Date(Posixt), Price, Sentiment (both numeric), etc.

I have two subsets of df1, which are df2 and df3 (there might be some overlaps). I want to remove all the rows of df2 and df3 from df1 (or df1-(df2 U df3), U is union).

I have tried subset, but it is really not easy to get the select parameter, since it is not a direct condition like ID!=100.

Of course, loop must be a solution to this problem, but it takes too much time and looks really ugly.
Is there is a certain way like a vector or matrix operation that can realize this quickly and concisely?

Upvotes: 1

Answers (2)

ako

Reputation: 3689

You can use boolean indexing instead.

1. generate data

data = data.frame(id=1:20,value=rnorm(20))

2. make two subsets, each 5 rows

data.1 = data[sample(nrow(data), 5), ]
data.2 = data[sample(nrow(data), 5), ]

3. index rows

Point is to keep only ids that are NOT (the ! operator) in either of the subset's ids. The pipe | character is the OR statement--if id is in either of the two subsets, we eliminate.

data[!(data$id %in% data.1$id | data$id %in% data.2$id),]

Upvotes: 1

Chase

Reputation: 69201

You can use the [ function to index directly into your df1 object instead of using subset(). We just need to create a logical vector that has the criteria we want. For that, we'll use the %in% function and some negation. This seems to do the trick:

df1 <- data.frame(id = 1:10, foo = rnorm(10), bar = runif(10))

#Randomly sample three rows to create df2 and df3
set.seed(2)
df2 <- df1[sample(1:10, 3, FALSE), ]
df3 <- df1[sample(1:10, 3, FALSE), ]

#what IDs are in df2 and df3?
c(df2$id, df3$id)
#--
[1] 2 7 5 2 9 8

#OK, so we want to get id's 1,3,4,6,10
df1[!(df1$id %in% c(df2$id, df3$id)),]
#--
   id        foo       bar
1   1 -0.5656801 0.8613120
3   3  0.1252706 0.5147147
4   4  1.3532248 0.8224739
6   6  0.3225545 0.9746704
10 10  2.1502097 0.9939075

Upvotes: 1

R Subset a data frame with a complex condition

Answers (2)

1. generate data

2. make two subsets, each 5 rows

3. index rows

Related Questions