Reputation: 359
I need to hadle a file with missing values ("NA") in Julia.
The command that I'm using to read the file is:
file = readdlm("FILE_NAs.txt", header=false)
The problem is that I can't use this files in math equations (like matrix multiplication) because the "NA"s.
I tried to use the package "DataArray" and the function "dropna(file)", but did not work.
So, I'd like to ignore or even remove those "NA"s values.
Here is a sample of the loaded file (space delimited):
"Ind1" "NA" "NA" "NA" "NA" "NA" "NA" 2 "NA" "NA"
"Ind2" "NA" "NA" "NA" "NA" "NA" "NA" 2 "NA" "NA"
"Ind3" "NA" "NA" "NA" "NA" "NA" "NA" 1 "NA" "NA"
"Ind4" "NA" "NA" "NA" "NA" "NA" "NA" 2 "NA" "NA"
"Ind5" 0 0 0 0 0 0 1 0 0
"Ind6" 1 0 0 0 1 1 2 1 1
"Ind7" 1 0 0 0 1 1 2 1 1
"Ind8" 0 0 0 0 0 0 2 0 0
Upvotes: 1
Views: 2999
Reputation: 995
The NA
type is explicitly designed to poison linear algebra operations, so you should not be multiplying arrays with NA
in them.
I am assuming that you load the data with something like
using DataFrames
x = readtable("FILE_NAs.txt", header = false, separator = ' ')
If you merely want to purge the rows containing NA
, then the easiest thing to do is probably to call
y = DataFrames.na_omit(x)[1]
That will yield a new DataFrame
where any row containing NA
has been purged. If you want to extract the numeric data from your example file, then something like
z = convert(Matrix{Int}, y[2:end])
should work. We can index y
like a vector because a DataFrame
behaves like a vector of columnar DataArray
s. Note that the conversion of a DataFrame
with NA
entries to a Matrix
will fail.
If instead you want to purge by column, then determine which columns have NA
in them. One way to do this is via
# get a Bool array of NA positions
y = array(map(isna, eachcol(x)))
# get a vector indexing columns with NA in them
z = vec(!reducedim(|, y, 1))
# now extract columns of x with no missing data
x[z] # <-- only has rows x1, x8
DataFrame
gurus may know a simpler way to do this.
Upvotes: 2