Indexing Julia's DataArrays with included NA values

Question

I am wondering why indexing Julia's DataArrays with NA values is not possible. Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):

dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA

dm[dm .< 3] = 1  #Error
dm[(!isna(dm)) & (dm .< 3)] = 1  #Working

There is a solutions to ignore NA's in a DataFrame with isna(), like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna() on each condition is not the best solution.

For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end

If you change the snippet by setting the NA's to false ...

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    A[A.na] = false
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end

... dm[dm .< 3] = 1 works like it should(like in MATLAB® or Pandas).

For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:

@timeit dm[(!isna(dm)) & (dm .< 3)] = 1  #14.55 µs per loop  
@timeit dm[dm .< 3] = 1  #754.79 ns per loop

What is the reason that the developers make use of this exception and is there another simpler approach as the !isna() for ignoring NA's in DataArrays?

Indexing Julia's DataArrays with included NA values

Answers (1)

Related Questions

Indexing Julia&#39;s DataArrays with included NA values

Answers (1)

Related Questions

Indexing Julia's DataArrays with included NA values