Amit Kumar Tiwari
Amit Kumar Tiwari

Reputation: 199

Julia | DataFrame | Replacing missing Values

How can we replace missing values with 0.0 for a column in a DataFrame?

Upvotes: 13

Views: 10998

Answers (4)

Cameron Bieganek
Cameron Bieganek

Reputation: 7704

There are a few different approaches to this problem (valid for Julia 1.x):

Base.replace!

Probably the easiest approach is to use replace! or replace from base Julia. Here is an example with replace!:

julia> using DataFrames

julia> df = DataFrame(x = [1, missing, 3])
3×1 DataFrame
│ Row │ x       │
│     │ Int64⍰  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 3       │

julia> replace!(df.x, missing => 0);

julia> df
3×1 DataFrame
│ Row │ x      │
│     │ Int64⍰ │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 0      │
│ 3   │ 3      │

However, note that at this point the type of column x still allows missing values:

julia> typeof(df.x)
Array{Union{Missing, Int64},1}

This is also indicated by the question mark following Int64 in column x when the data frame is printed out. You can change this by using disallowmissing! (from the DataFrames.jl package):

julia> disallowmissing!(df, :x)
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 0     │
│ 3   │ 3     │

Alternatively, if you use replace (without the exclamation mark) as follows, then the output will already disallow missing values:

julia> df = DataFrame(x = [1, missing, 3]);

julia> df.x = replace(df.x, missing => 0);

julia> df
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 0     │
│ 3   │ 3     │

Finally, you can replace missing in all columns at once by using mapcols:

julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing]);

julia> mapcols(col -> replace(col, missing => 0), df)
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     0      5
   3 │     3      0

Base.ismissing with logical indexing

You can use ismissing with logical indexing to assign a new value to all missing entries of an array:

julia> df = DataFrame(x = [1, missing, 3]);

julia> df.x[ismissing.(df.x)] .= 0;

julia> df
3×1 DataFrame
│ Row │ x      │
│     │ Int64⍰ │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 0      │
│ 3   │ 3      │

Base.coalesce

Another approach is to use coalesce:

julia> df = DataFrame(x = [1, missing, 3]);

julia> df.x = coalesce.(df.x, 0);

julia> df
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 0     │
│ 3   │ 3     │

DataFramesMeta

Both replace and coalesce can be used with the @transform macro from the DataFramesMeta.jl package:

julia> using DataFramesMeta

julia> df = DataFrame(x = [1, missing, 3]);

julia> @transform(df, x = replace(:x, missing => 0))
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 0     │
│ 3   │ 3     │
julia> df = DataFrame(x = [1, missing, 3]);

julia> @transform(df, x = coalesce.(:x, 0))
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 0     │
│ 3   │ 3     │

Additional documentation

Upvotes: 18

srgk26
srgk26

Reputation: 21

This is a shorter and more updated answer since Julia introduced the missing attribute recently.

using DataFrames
df = DataFrame(A=rand(1:50, 5), B=rand(1:50, 5), C=vcat(rand(1:50,3), missing, rand(1:50))) ## Creating random 5 integers within the range of 1:50, while introducing a missing variable in one of the rows
df = DataFrame(replace!(convert(Matrix, df), missing=>0)) ## Converting to matrix first, since replacing values directly within type dataframe is not allowed

Upvotes: 1

Dan Getz
Dan Getz

Reputation: 18227

The other answers are pretty good all over. If you are a real speed junky, perhaps the following might be for you:

# prepare example
using DataFrames
df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
df[ df[:A] %2 .== 0, :B ] = NA


df[:B].data[df[:B].na] = 0.0 # put the 0.0 into NAs
df[:B] = df[:B].data         # with no NAs might as well use array

Upvotes: 1

Felipe Lema
Felipe Lema

Reputation: 2718

create df with some NAs

using DataFrames
df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
df[ df[:B] %2 .== 0, :A ] = NA

you'll see some NA in df... we now convert them to 0.0

df[ isna(df[:A]), :A] = 0

EDIT=NaNNA. Thanks @Reza

Upvotes: 2

Related Questions