xiaodai
xiaodai

Reputation: 16074

What's an efficient way to fill `missing` values with previous non-missing value?

I have a vector

using Missings
v = allowmissing(rand(100))
v[rand(100) .< 0.1] .= missing

what's the best way to fill v with the last non-missing value?

Currently

for (i, val) in enumerate(v)
  ismissing(val) && (i >=2) && (v[i]=v[i-1])
end
first_non_missing = findfirst(x->!ismissing(x), v)
if first_non_missing >= 2
  v[1:first_non_missing -1] .= v[first_non_missing]
end
v = disallowmissing(v)

But I found it to be slow for large vectors. What's an elegant and efficient way to fill missing values with previous non-missing values?

Upvotes: 1

Views: 625

Answers (3)

Hongtao Hao
Hongtao Hao

Reputation: 83

The following answer is entirely based on the discussions in this thread: Julia DataFrame Fill NA with LOCF. More specifically, it is based on the answers by Danish Shrestha, Dan Getz, and btsays.

As laborg implies, the accumulate function in Base Julia will do the job.

Suppose we have an array: a = [1, missing, 2, missing, 9]. We want to replace the 1st missing with 1 and the second with 2: a = [1, 1, 2, 2, 9], which is a = a[[1, 1, 3, 3, 5]] ([1, 1, 3, 3, 5] here are indexes).

This function will do the job:

ffill(v) = v[accumulate(max, [i*!ismissing(v[i]) for i in 1:length(v)], init=1)]

BTW, "ffill" means "forward filling", a name I adopted from Pandas.

I'll explain in the following.

What the accumulate function does is that it returns a new array based on the array we input.

For those of you who are new to Julia like me: in Julia's mathematical operations, i*true = i, and i*false=0. Therefore, when an element in the array is NOT missing, then i*!ismissing() = i; otherwise, i*!ismissing() = 0.

In the case of a = [1, missing, 2, missing, 9], [i*!ismissing(a[i]) for i in 1:length(a)] will return [1, 0, 3, 0, 5]. Since this array is in the accumulate function where the operation is max, we'll get [1, 1, 3, 3, 5].

Then a[[1, 1, 3, 3, 5]] will return [1, 1, 2, 2, 9].

That's why

a = ffill(a)

will get [1, 1, 2, 2, 9].

Now, you may wonder why we have init = 1 in ffill(v). Say, b = [missing, 1, missing, 3]. Then, [i*!ismissing(b[i]) for i in 1:length(b)] will return [0, 2, 0, 4]. Then the accumulate function will return [0, 2, 2, 4]. The next step, b[[0, 2, 2, 4]] will throw an error because in Julia, index starts from 1 not 0. Therefore, b[0] doesn't mean anything.

With init = 1 in the accumulate function, we'll get [1, 2, 2, 4] rather than [0, 2, 2, 4] since 1 (the init we set) is larger than 0 (the first number).

We can go further form here. The ffill() function above only works for a single array. But what if we have a large dataframe?

Say, we have:

using DataFrames

a = ["Tom", "Mike", "John", "Jason", "Bob"]
b = [missing, 2, 3, missing, 8]
c = [1, 3, missing, 99, missing]
df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
julia> df

5×3 DataFrame
 Row │ Name    Var1     Var2    
     │ String  Int64?   Int64?  
─────┼──────────────────────────
   1 │ Tom     missing        1
   2 │ Mike          2        3
   3 │ John          3  missing 
   4 │ Jason   missing       99
   5 │ Bob           8  missing 

Here, Dan Getz's answer comes in handy:

nona_df = DataFrame([ffill(df[!, c]) for c in names(df)], names(df))
julia> nona_df 

5×3 DataFrame
 Row │ Name    Var1     Var2   
     │ String  Int64?   Int64? 
─────┼─────────────────────────
   1 │ Tom     missing       1
   2 │ Mike          2       3
   3 │ John          3       3
   4 │ Jason         3      99
   5 │ Bob           8      99

Upvotes: 1

laborg
laborg

Reputation: 871

A simple and fast solution:

replace_missing!(v) = accumulate!((n0,n1) -> ismissing(n1) ? n0 : n1, v, v, init=zero(eltype(v)))

Upvotes: 2

longemen3000
longemen3000

Reputation: 1313

you need an init value in case the fist value is missing, and i can't execute your code. but with that said, here is my attempt:

function replace_missing!(v,init=zero(eltype(v)))
    function reduce_missing(n0,n1)
        if ismissing(n1)
            return n0
        else
            return n1
        end
    end
    v[1] = reduce_missing(init,v[1])
    for i = 2:length(v)
        v[i] = reduce_missing(v[i-1],v[i])
    end
return v
end
using Missings
v = allowmissing(rand(100))
v[rand(100) .< 0.1] .= missing
v = replace_missing!(v)
v = disallowmissing(v)

Upvotes: 1

Related Questions