msmf14
msmf14

Reputation: 1499

Getting Number of Elements of Dataframe equal to elements in List - R

I have a dataframe with the following structure:

DataFrame$Fruit
Apple
Banana
Apple
Mango
Banana

etc

I would like to use the data (for instance count the numbers, but also do other manipulations), for which the Fruits are the same in some given array:

keepFruit = c('Apple', 'Banana')

The following expression: Dataframe$Fruit[which(Dataframe$Fruit == 'Apple' | Dataframe$Fruit == 'Banana')] returns the correct number of elements, however the following expression gives an erroneous number: Dataframe$Fruit[which(Dataframe$Fruit == keepFruit)]

I understand the second expression is comparing an array to a single text entry, however how come it still returns somewhat correct results (returns elements where the fruits match up) though not the same number (i.e. not all)? And what is a better way to get the data for all fruits that are members of the keepFruit array?

Upvotes: 2

Views: 248

Answers (2)

akrun
akrun

Reputation: 886948

You can also Vectorize

Vectorize(function(x) df$Fruit==x)(keepFruit) #which gives the logical index of matching positions
     Apple Banana
#[1,]  TRUE  FALSE
#[2,] FALSE   TRUE
#[3,]  TRUE  FALSE
#[4,] FALSE  FALSE
#[5,] FALSE   TRUE

Upvotes: 1

Gavin Simpson
Gavin Simpson

Reputation: 174788

You want %in% for this, not ==. Here's an example:

df <- data.frame(Fruit = c("Apple","Banana","Apple","Mango","Banana"))
keepFruit = c('Apple', 'Banana')

ind <- with(df, Fruit %in% keepFruit)
df[ind, , drop = FALSE]

Giving

> df[ind, , drop = FALSE]
   Fruit
1  Apple
2 Banana
3  Apple
5 Banana

As for why == doesn't work, you must appreciate that == does an element-wise comparison of Fruit and keepFruit, i.e. it compare the first elements of each vector for equality, then the second elements of each vector, and so on. Now keepFruit is not of the same length as Fruit, so R recycles the elements of keepFruit to match the length of Fruit, the longer of the two vectors. It is as if R has done

with(df, Fruit == rep(keepFruit, length = length(Fruit)))

or

keepFruit2 <- with(df, rep(keepFruit, length = length(Fruit)))
with(df, Fruit == keepFruit2)
rm(keepFruit2)

both return (which is not what you wanted but is correct as far as R's rules go)

> with(df, Fruit == rep(keepFruit, length = length(Fruit)))
[1]  TRUE  TRUE  TRUE FALSE FALSE

R does a little bit more than this however, because it warns you that the length of the longer vector Fruit is not a multiple of the length of the shorter vector keepFruit - this almost always indicates a problem, hence the warning:

> with(df, Fruit == keepFruit)
[1]  TRUE  TRUE  TRUE FALSE FALSE
Warning messages:
1: In is.na(e1) | is.na(e2) :
  longer object length is not a multiple of shorter object length
2: In `==.default`(Fruit, keepFruit) :
  longer object length is not a multiple of shorter object length

To see in more detail what happens, we can augment df with the values used in the comparison with == and the result

df2 <- within(df, {
    Keep <- rep(keepFruit, length = length(Fruit))
    Result <- Fruit == Keep
  })
df2

> df2
   Fruit Result   Keep
1  Apple   TRUE  Apple
2 Banana   TRUE Banana
3  Apple   TRUE  Apple
4  Mango  FALSE Banana
5 Banana  FALSE  Apple

Now you should be able to see why you got the result you did with == and why it was wrong.

Upvotes: 4

Related Questions