Julia subsetting dataframe with multiple conditions

Question

In DataFramesMeta, why should I wrap every condition within a pair of parentheses? Below is an example dataframe where I want a subset that contains values greater than 1 or is missing.

d = DataFrame(a = [1, 2, missing], b = ["x", "y", missing]);

Using DataFramesMeta to subset:

@chain d begin
    @subset @byrow begin
        (:a > 1) | (:a===missing) 
    end
end

If I don't use parentheses, errors pop up.

@chain d begin
    @subset @byrow begin
        :a > 1 | :a===missing 
    end
end
# ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context

Bogumił Kamiński · Accepted Answer

The reason is operator precedence (and is unrelated to DataFramesMeta.jl).

See:

julia> dump(:(2 > 1 | 3 > 4))
Expr
  head: Symbol comparison
  args: Array{Any}((5,))
    1: Int64 2
    2: Symbol >
    3: Expr
      head: Symbol call
      args: Array{Any}((3,))
        1: Symbol |
        2: Int64 1
        3: Int64 3
    4: Symbol >
    5: Int64 4

as you can see 2 > 1 | 3 > 4 gets parsed as: 2 > (1 | 3) > 4 which is not what you want.

However, I would recommend you the following syntax for your case:

julia> @chain d begin
           @subset @byrow begin
               coalesce(:a > 1, true)
           end
       end
2×2 DataFrame
 Row │ a        b
     │ Int64?   String?
─────┼──────────────────
   1 │       2  y
   2 │ missing  missing

or

julia> @chain d begin
           @subset @byrow begin
               ismissing(:a) || :a > 1
           end
       end
2×2 DataFrame
 Row │ a        b
     │ Int64?   String?
─────┼──────────────────
   1 │       2  y
   2 │ missing  missing

I personally prefer coalesce but it is a matter of taste.

Note that || as opposed to | does not require parentheses, but you need to reverse the order of the conditions to take advantage of short circuiting behavior of || as if you reversed the conditions you would get an error:

julia> @chain d begin
           @subset @byrow begin
               :a > 1 || ismissing(:a)
           end
       end
ERROR: TypeError: non-boolean (Missing) used in boolean context

Finally with @rsubset this can be just:

julia> @chain d begin
           @rsubset coalesce(:a > 1, true)
       end
2×2 DataFrame
 Row │ a        b
     │ Int64?   String?
─────┼──────────────────
   1 │       2  y
   2 │ missing  missing

(I assume you want @chain as this is one of the steps you want to do in the analysis so I keep it)

Julia subsetting dataframe with multiple conditions

Answers (1)

Related Questions