SteveS
SteveS

Reputation: 4040

Why does dplyr filter_all using all_vars > 0 work on a character string?

Given:

df <- structure(list(word = c("aaliyahmaxwell", "abasc", "abbslovesfed", 
"abbycastro", "abc", "abccarpet", "abdul", "ability", "abnormile", 
"abraham"), chardonnay = c(4, 0, 0, 0, 0, 0, 0, 0, 0, 0), coffee = c(0, 
1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("word", "chardonnay", 
"coffee"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

Why does df %>% filter_all(all_vars(. > 0)) work?

I mean that my first column is of type character and can't be > 0. I can understand why it works on the other two columns but need an explanation on why it works when I have a mixture of character and double type columns.

Please advise.

Upvotes: 1

Views: 654

Answers (2)

DSGym
DSGym

Reputation: 2867

Even though there is already a good answer, I think this can be made clearer with an example:

> c("a", 0)
[1] "a" "0"

Here you can see what happens, the number gets coerced to a character.

Characters get compared lexically. Example:

> "b" > "a" 
[1] TRUE

> "a" > "5"
[1] TRUE

> charvector <- sample(c(seq(1,9), LETTERS))
> charvector
 [1] "6" "D" "T" "U" "I" "R" "F" "S" "J" "W" "B" "A" "8" "E" "2" "7" "O" "Z" "V" "G" "9" "4" "H" "C" "Y" "1" "X" "5" "M" "K" "Q" "L" "N" "3" "P"

The order becomes also clear when you sort that vector:

> sort(charvector)
 [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Upvotes: 2

akrun
akrun

Reputation: 887501

It is due to type change. Here, 0 a numeric entry gets type converted to a character one. According to `?Comparison

If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.

df %>%
   filter(word > 0)

giving all the rows of the original data because

letters > 0
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[26] TRUE

In the 'word' column, it is all characters which would any way be greater than "0" due to type conversion, leaving only the all_vars to essentially check whether the other numeric columns are greater than 0


In the OP's dataset example, none of the rows match the criteria because one of the numeric columns is always less than or equal to 0 in each of the rows. If we change the first row of 'coffee' to 2 or 1, that row would be picked up because the 'chardonnay' is greater than 0, the first column 'word' is always greater

df$coffee[1] <- 2
df %>%
    filter_all(all_vars(. > 0))
# A tibble: 1 x 3
#  word           chardonnay coffee
#   <chr>               <dbl>  <dbl>
#1 aaliyahmaxwell          4      2

To select only numeric columns, use filter_if (as in the comments)

df %>% 
   filter_if(is.numeric, all_vars(. > 0))

Upvotes: 2

Related Questions