godines
godines

Reputation: 359

How to handle "NA" in Julia

I need to hadle a file with missing values ("NA") in Julia.
The command that I'm using to read the file is:

file = readdlm("FILE_NAs.txt", header=false)

The problem is that I can't use this files in math equations (like matrix multiplication) because the "NA"s.
I tried to use the package "DataArray" and the function "dropna(file)", but did not work.
So, I'd like to ignore or even remove those "NA"s values.

Here is a sample of the loaded file (space delimited):

"Ind1" "NA"  "NA"  "NA"   "NA"   "NA"   "NA"  2   "NA"   "NA"
"Ind2" "NA"  "NA"  "NA"   "NA"   "NA"   "NA"  2   "NA"   "NA"
"Ind3" "NA"  "NA"  "NA"   "NA"   "NA"   "NA"  1   "NA"   "NA"
"Ind4" "NA"  "NA"  "NA"   "NA"   "NA"   "NA"  2   "NA"   "NA"
"Ind5" 0     0     0      0      0      0     1   0      0 
"Ind6" 1     0     0      0      1      1     2   1      1 
"Ind7" 1     0     0      0      1      1     2   1      1 
"Ind8" 0     0     0      0      0      0     2   0      0

Upvotes: 1

Views: 2999

Answers (1)

Kevin L. Keys
Kevin L. Keys

Reputation: 995

The NA type is explicitly designed to poison linear algebra operations, so you should not be multiplying arrays with NA in them.

I am assuming that you load the data with something like

using DataFrames
x = readtable("FILE_NAs.txt", header = false, separator = ' ')

If you merely want to purge the rows containing NA, then the easiest thing to do is probably to call

y = DataFrames.na_omit(x)[1]

That will yield a new DataFrame where any row containing NA has been purged. If you want to extract the numeric data from your example file, then something like

z = convert(Matrix{Int}, y[2:end])

should work. We can index y like a vector because a DataFrame behaves like a vector of columnar DataArrays. Note that the conversion of a DataFrame with NA entries to a Matrix will fail.

If instead you want to purge by column, then determine which columns have NA in them. One way to do this is via

# get a Bool array of NA positions
y = array(map(isna, eachcol(x)))

# get a vector indexing columns with NA in them
z = vec(!reducedim(|, y, 1))

# now extract columns of x with no missing data
x[z] # <-- only has rows x1, x8

DataFrame gurus may know a simpler way to do this.

Upvotes: 2

Related Questions