Reputation: 1499
I have a dataframe with the following structure:
DataFrame$Fruit
Apple
Banana
Apple
Mango
Banana
etc
I would like to use the data (for instance count the numbers, but also do other manipulations), for which the Fruits are the same in some given array:
keepFruit = c('Apple', 'Banana')
The following expression: Dataframe$Fruit[which(Dataframe$Fruit == 'Apple' | Dataframe$Fruit == 'Banana')]
returns the correct number of elements, however the following expression gives an erroneous number: Dataframe$Fruit[which(Dataframe$Fruit == keepFruit)]
I understand the second expression is comparing an array to a single text entry, however how come it still returns somewhat correct results (returns elements where the fruits match up) though not the same number (i.e. not all)? And what is a better way to get the data for all fruits that are members of the keepFruit
array?
Upvotes: 2
Views: 248
Reputation: 886948
You can also Vectorize
Vectorize(function(x) df$Fruit==x)(keepFruit) #which gives the logical index of matching positions
Apple Banana
#[1,] TRUE FALSE
#[2,] FALSE TRUE
#[3,] TRUE FALSE
#[4,] FALSE FALSE
#[5,] FALSE TRUE
Upvotes: 1
Reputation: 174788
You want %in%
for this, not ==
. Here's an example:
df <- data.frame(Fruit = c("Apple","Banana","Apple","Mango","Banana"))
keepFruit = c('Apple', 'Banana')
ind <- with(df, Fruit %in% keepFruit)
df[ind, , drop = FALSE]
Giving
> df[ind, , drop = FALSE]
Fruit
1 Apple
2 Banana
3 Apple
5 Banana
As for why ==
doesn't work, you must appreciate that ==
does an element-wise comparison of Fruit
and keepFruit
, i.e. it compare the first elements of each vector for equality, then the second elements of each vector, and so on. Now keepFruit
is not of the same length as Fruit
, so R recycles the elements of keepFruit
to match the length of Fruit
, the longer of the two vectors. It is as if R has done
with(df, Fruit == rep(keepFruit, length = length(Fruit)))
or
keepFruit2 <- with(df, rep(keepFruit, length = length(Fruit)))
with(df, Fruit == keepFruit2)
rm(keepFruit2)
both return (which is not what you wanted but is correct as far as R's rules go)
> with(df, Fruit == rep(keepFruit, length = length(Fruit)))
[1] TRUE TRUE TRUE FALSE FALSE
R does a little bit more than this however, because it warns you that the length of the longer vector Fruit
is not a multiple of the length of the shorter vector keepFruit
- this almost always indicates a problem, hence the warning:
> with(df, Fruit == keepFruit)
[1] TRUE TRUE TRUE FALSE FALSE
Warning messages:
1: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
2: In `==.default`(Fruit, keepFruit) :
longer object length is not a multiple of shorter object length
To see in more detail what happens, we can augment df
with the values used in the comparison with ==
and the result
df2 <- within(df, {
Keep <- rep(keepFruit, length = length(Fruit))
Result <- Fruit == Keep
})
df2
> df2
Fruit Result Keep
1 Apple TRUE Apple
2 Banana TRUE Banana
3 Apple TRUE Apple
4 Mango FALSE Banana
5 Banana FALSE Apple
Now you should be able to see why you got the result you did with ==
and why it was wrong.
Upvotes: 4