Reputation: 2164
I'm doing some subsetting using subset()
but having some issues with using the %in%
command for my logical statement.
Consider a simple data structure like
x11 x21 x12 x22
1 19 2000 32 2004
2 19 2000 20 2001
I want a subset where it is true that x12
is either equal to x22-x21+x11
or equal to x22-x21+x11+1
.* For the example above, I want the second row, since the logical statement should evaluate to TRUE
if and only if x12
is 20 or 21, which is satisfied. For this simple setup, the following works for me:
> test1 <- data.frame(x11=c(19, 19), x21=c(2000, 2000), x12=c(32, 20), x22=c(2004, 2001))
> subset(test1, (x12 %in% c(x22-x21+x11, x22-x21+x11+1)))
x11 x21 x12 x22
2 19 2000 20 2001
The problem arises when I introduce additional rows. Adding just one row:
x11 x21 x12 x22
1 19 2000 32 2004
2 19 2000 20 2001
3 30 1998 32 2000
Now, I would like to subset this so that I get rows two and three. But using the same subset strategy as above:
> test2 <- data.frame(x11=c(19, 19, 30), x21=c(2000, 2000, 1998), x12=c(32, 20, 32), x22=c(2004, 2001, 2000))
> subset(test2, (x12 %in% c(x22-x21+x11, x22-x21+x11+1)))
x11 x21 x12 x22
1 19 2000 32 2004
2 19 2000 20 2001
3 30 1998 32 2000
So now I get the row which I did not get in the first example. My guess is that it is related to the vector which x12
is allowed to be in, i.e. c(x22-x21+x11, x22-x21+x11+1)
, but I'm not sure how to construct this so that it is implied to be "row-wise" and not one vector for all rows.
Ideas are much appreciated!
*x11 is the age of an individual at time point x21, and x12 is the age of a (possibly different) individual at time point x22. I want the subset containing the rows in which the age (x11) at x21 is logically and physically compatible with the age (x12) at x22; an individual who is 19 in 2000 is either 19, 20 or 21 in 2001 depending on birthdays (but I discard the possibility of the individual being the same age here, for other reasons). Thus, the first row, in which we have age 19 in the year 2000, and age 32 at 2004, is not possible for the same individual.
Upvotes: 1
Views: 74
Reputation: 21502
First of all, beware of floating point precision limits. If your values are all integers, this doesn't matter, but in the general case x==y
can fail unless you use tools like all.equal
.
Now, rather than mucking with subset
or %in%
, just write a conditional:
foo <- test1[(test1[,3]==(test1[,1]-test1[,2]+test1[,4])) |
(test1[,3]==(test1[,1]-test1[,2]+test1[,4]+1)), ]
You may need to run apply
on a row-by-row basis.
Upvotes: 2
Reputation: 56159
Try this:
#data
test2 <- data.frame(x11=c(19, 19, 30),
x21=c(2000, 2000, 1998),
x12=c(32, 20, 32),
x22=c(2004, 2001, 2000))
#range pre-computed
test2$in1 <- test2$x22-test2$x21+test2$x11
test2$in2 <- test2$x22-test2$x21+test2$x11+1
#subset
test2[ test2$x12 >= test2$in1 &
test2$x12 <= test2$in2,]
# x11 x21 x12 x22 in1 in2
# 2 19 2000 20 2001 20 21
# 3 30 1998 32 2000 32 33
Upvotes: 1