Reputation: 16074
I have a vector
using Missings
v = allowmissing(rand(100))
v[rand(100) .< 0.1] .= missing
what's the best way to fill v
with the last non-missing value?
Currently
for (i, val) in enumerate(v)
ismissing(val) && (i >=2) && (v[i]=v[i-1])
end
first_non_missing = findfirst(x->!ismissing(x), v)
if first_non_missing >= 2
v[1:first_non_missing -1] .= v[first_non_missing]
end
v = disallowmissing(v)
But I found it to be slow for large vectors. What's an elegant and efficient way to fill missing values with previous non-missing values?
Upvotes: 1
Views: 625
Reputation: 83
The following answer is entirely based on the discussions in this thread: Julia DataFrame Fill NA with LOCF. More specifically, it is based on the answers by Danish Shrestha, Dan Getz, and btsays.
As laborg implies, the accumulate function in Base Julia will do the job.
Suppose we have an array: a = [1, missing, 2, missing, 9]. We want to replace the 1st missing
with 1
and the second with 2
: a = [1, 1, 2, 2, 9], which is a = a[[1, 1, 3, 3, 5]]
([1, 1, 3, 3, 5] here are indexes).
This function will do the job:
ffill(v) = v[accumulate(max, [i*!ismissing(v[i]) for i in 1:length(v)], init=1)]
BTW, "ffill" means "forward filling", a name I adopted from Pandas.
I'll explain in the following.
What the accumulate
function does is that it returns a new array based on the array we input.
For those of you who are new to Julia like me: in Julia's mathematical operations, i*true = i
, and i*false=0
. Therefore, when an element in the array is NOT missing, then i*!ismissing() = i
; otherwise, i*!ismissing() = 0
.
In the case of a = [1, missing, 2, missing, 9], [i*!ismissing(a[i]) for i in 1:length(a)]
will return [1, 0, 3, 0, 5]
. Since this array is in the accumulate
function where the operation is max
, we'll get [1, 1, 3, 3, 5]
.
Then a[[1, 1, 3, 3, 5]]
will return [1, 1, 2, 2, 9]
.
That's why
a = ffill(a)
will get [1, 1, 2, 2, 9]
.
Now, you may wonder why we have init = 1
in ffill(v)
. Say, b = [missing, 1, missing, 3]
. Then, [i*!ismissing(b[i]) for i in 1:length(b)]
will return [0, 2, 0, 4]
. Then the accumulate
function will return [0, 2, 2, 4]. The next step, b[[0, 2, 2, 4]] will throw an error because in Julia, index starts from 1
not 0
. Therefore, b[0] doesn't mean anything.
With init = 1
in the accumulate
function, we'll get [1, 2, 2, 4] rather than [0, 2, 2, 4] since 1 (the init
we set) is larger than 0 (the first number).
We can go further form here. The ffill()
function above only works for a single array. But what if we have a large dataframe?
Say, we have:
using DataFrames
a = ["Tom", "Mike", "John", "Jason", "Bob"]
b = [missing, 2, 3, missing, 8]
c = [1, 3, missing, 99, missing]
df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
julia> df
5×3 DataFrame
Row │ Name Var1 Var2
│ String Int64? Int64?
─────┼──────────────────────────
1 │ Tom missing 1
2 │ Mike 2 3
3 │ John 3 missing
4 │ Jason missing 99
5 │ Bob 8 missing
Here, Dan Getz's answer comes in handy:
nona_df = DataFrame([ffill(df[!, c]) for c in names(df)], names(df))
julia> nona_df
5×3 DataFrame
Row │ Name Var1 Var2
│ String Int64? Int64?
─────┼─────────────────────────
1 │ Tom missing 1
2 │ Mike 2 3
3 │ John 3 3
4 │ Jason 3 99
5 │ Bob 8 99
Upvotes: 1
Reputation: 871
A simple and fast solution:
replace_missing!(v) = accumulate!((n0,n1) -> ismissing(n1) ? n0 : n1, v, v, init=zero(eltype(v)))
Upvotes: 2
Reputation: 1313
you need an init value in case the fist value is missing, and i can't execute your code. but with that said, here is my attempt:
function replace_missing!(v,init=zero(eltype(v)))
function reduce_missing(n0,n1)
if ismissing(n1)
return n0
else
return n1
end
end
v[1] = reduce_missing(init,v[1])
for i = 2:length(v)
v[i] = reduce_missing(v[i-1],v[i])
end
return v
end
using Missings
v = allowmissing(rand(100))
v[rand(100) .< 0.1] .= missing
v = replace_missing!(v)
v = disallowmissing(v)
Upvotes: 1