Lucas
Lucas

Reputation: 1448

How to handle missing in boolean context in Julia?

I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings in the numerical column

Here is a replicable example:

using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;

df = dataset("datasets","iris")
#lowercase columns just for convenience
@pipe df |> rename!(_, [lowercase(k) for k in names(df)]);

#without this line, the code works fine
@pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);

df[:size] = @. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = @. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = @. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = @. ifelse(df[:sepallength]>5, "huge", df[:size])

println(@pipe df |> freqtable(_, :size))

Output:

TypeError: non-boolean (Missing) used in boolean context

I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength the column df[:size] would have a different length than the original dataframe.

Upvotes: 4

Views: 1262

Answers (2)

Nils Gudat
Nils Gudat

Reputation: 13800

I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:

julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;

julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;

julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;

julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;

julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;

julia> x = rand([missing; 1:10], 50)

julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...

julia> x .>ₘ 10
50-element BitArray{1}
...

There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.

Upvotes: 3

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69879

Use the coalesce function like this:

julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
 1
 2
 3
  missing
 5
 6
 7

julia> @. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
 "small"
 "small"
 "small"
 missing
 missing
 missing
 missing

As a side note do not write df[:size] (this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size or df."size" to access the column of the data frame (the df."size" is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").

Upvotes: 5

Related Questions