hejseb
hejseb

Reputation: 2164

Condition for subsetting data in R using %in%

I'm doing some subsetting using subset() but having some issues with using the %in% command for my logical statement.

Consider a simple data structure like

  x11  x21 x12  x22
1  19 2000  32 2004
2  19 2000  20 2001

I want a subset where it is true that x12 is either equal to x22-x21+x11 or equal to x22-x21+x11+1.* For the example above, I want the second row, since the logical statement should evaluate to TRUE if and only if x12 is 20 or 21, which is satisfied. For this simple setup, the following works for me:

> test1 <- data.frame(x11=c(19, 19), x21=c(2000, 2000), x12=c(32, 20), x22=c(2004, 2001))
> subset(test1, (x12 %in% c(x22-x21+x11, x22-x21+x11+1)))
  x11  x21 x12  x22
2  19 2000  20 2001

The problem arises when I introduce additional rows. Adding just one row:

  x11  x21 x12  x22
1  19 2000  32 2004
2  19 2000  20 2001
3  30 1998  32 2000

Now, I would like to subset this so that I get rows two and three. But using the same subset strategy as above:

> test2 <- data.frame(x11=c(19, 19, 30), x21=c(2000, 2000, 1998), x12=c(32, 20, 32), x22=c(2004, 2001, 2000))
> subset(test2, (x12 %in% c(x22-x21+x11, x22-x21+x11+1)))
  x11  x21 x12  x22
1  19 2000  32 2004
2  19 2000  20 2001
3  30 1998  32 2000

So now I get the row which I did not get in the first example. My guess is that it is related to the vector which x12 is allowed to be in, i.e. c(x22-x21+x11, x22-x21+x11+1), but I'm not sure how to construct this so that it is implied to be "row-wise" and not one vector for all rows.

Ideas are much appreciated!


*x11 is the age of an individual at time point x21, and x12 is the age of a (possibly different) individual at time point x22. I want the subset containing the rows in which the age (x11) at x21 is logically and physically compatible with the age (x12) at x22; an individual who is 19 in 2000 is either 19, 20 or 21 in 2001 depending on birthdays (but I discard the possibility of the individual being the same age here, for other reasons). Thus, the first row, in which we have age 19 in the year 2000, and age 32 at 2004, is not possible for the same individual.

Upvotes: 1

Views: 74

Answers (2)

Carl Witthoft
Carl Witthoft

Reputation: 21502

First of all, beware of floating point precision limits. If your values are all integers, this doesn't matter, but in the general case x==y can fail unless you use tools like all.equal .
Now, rather than mucking with subset or %in%, just write a conditional:

foo <- test1[(test1[,3]==(test1[,1]-test1[,2]+test1[,4])) |
               (test1[,3]==(test1[,1]-test1[,2]+test1[,4]+1)), ]

You may need to run apply on a row-by-row basis.

Upvotes: 2

zx8754
zx8754

Reputation: 56159

Try this:

#data
test2 <- data.frame(x11=c(19, 19, 30),
                    x21=c(2000, 2000, 1998),
                    x12=c(32, 20, 32),
                    x22=c(2004, 2001, 2000))
#range pre-computed
test2$in1 <- test2$x22-test2$x21+test2$x11
test2$in2 <- test2$x22-test2$x21+test2$x11+1

#subset
test2[ test2$x12 >= test2$in1 &
         test2$x12 <= test2$in2,]
#   x11  x21 x12  x22 in1 in2
# 2  19 2000  20 2001  20  21
# 3  30 1998  32 2000  32  33

Upvotes: 1

Related Questions