Reputation: 2542
I am wondering why indexing Julia's DataArrays with NA values is not possible. Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):
dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA
dm[dm .< 3] = 1 #Error
dm[(!isna(dm)) & (dm .< 3)] = 1 #Working
There is a solutions to ignore NA's in a DataFrame with isna()
, like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna()
on each condition is not the best solution.
For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false
like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
If you change the snippet by setting the NA's to false ...
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
... dm[dm .< 3] = 1
works like it should(like in MATLAB® or Pandas).
For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:
@timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
@timeit dm[dm .< 3] = 1 #754.79 ns per loop
What is the reason that the developers make use of this exception and is there another simpler approach as the !isna()
for ignoring NA's in DataArrays?
Upvotes: 0
Views: 287
Reputation: 556
Suppose you have three rabbits. You want to put the female rabbit(s) in a separate cage from the males. You look at the first rabbit, and it looks like a male, so you leave it where it is. You look at the second rabbit, and it looks like a female, so you move it to the separate cage. You can't really get a good look at the third rabbit. What should you do?
It depends. Maybe you're fine with leaving the rabbit of unknown sex behind. But if you're separating out the rabbits because you don't want them to make baby rabbits, then you might want your analysis software to tell you that it doesn't know the sex of the third rabbit.
Situations like this arise often when analyzing data. In the most pathological cases, data is missing systematically rather than at random. If you were to survey a bunch of people about how fluffy rabbits are and whether they should be eaten more, you could compare mean(fluffiness[should_be_eaten_more])
and mean(fluffiness[!should_be_eaten_more])
. But, if people who really like rabbits are incensed that you're talking about eating them at all, they might leave that second question blank. If you ignore that, you will underestimate the mean fluffiness rating among people who don't think rabbits should be eaten more, which would be a grave mistake. This is why fluffiness[!should_be_eaten_more]
will throw an error if there are missing values: It is a sign that whatever you are trying to do with your data may not give the right results. This situation is bad enough that people write entire papers about it, e.g. this one.
Enough about rabbits. It is possible that there should be (and may someday be) a more concise way to drop/keep all missing values when indexing, but it will always be explicit rather than implicit for the reason described above. As far as performance goes, while there is a slowdown for isna(x) & (x < 3)
vs x < 3
, the overhead of repeatedly indexing into an array is also high, and DataArrays adds additional overhead on top of that. The relative overhead decreases as the array gets larger. If this is a bottleneck in your code, your best bet is to write it differently.
Upvotes: 6