Reputation: 609
I have a set of weather data from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/.
The dataset includes temperatures and rainfall etc. and uses -9999 as a dummy value to represent missing data.
I would like to replace that value with missing
in a DataFrame so that it will not be included in statistical calculations or plots. Is there a way I can do this as I create the dataframe? Or can it be done after the dataframe is created?
Upvotes: 5
Views: 1863
Reputation: 69949
Additionally to what Dan Getz proposes there are two options:
recode
function(the recode
function is defined in the CategoricalArrays.jl package so you need to load it first)
I am using Dan's example:
julia> df = DataFrame(x=rand(10),y=[rand()<0.3 ? 9999.0 : rand() for i=1:10])
10×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───────────┼──────────┤
│ 1 │ 0.856388 │ 0.322763 │
│ 2 │ 0.360254 │ 9999.0 │
│ 3 │ 0.229875 │ 0.906697 │
│ 4 │ 0.275965 │ 0.485042 │
│ 5 │ 0.126336 │ 0.205509 │
│ 6 │ 0.879974 │ 0.752962 │
│ 7 │ 0.0518579 │ 9999.0 │
│ 8 │ 0.512231 │ 0.759513 │
│ 9 │ 0.309586 │ 9999.0 │
│ 10 │ 0.616471 │ 0.978771 │
julia> df[:y] = recode(df[:y], 9999.0=>missing)
10-element Array{Union{Float64, Missings.Missing},1}:
0.322763
missing
0.906697
0.485042
0.205509
0.752962
missing
0.759513
missing
0.978771
julia> df
10×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───────────┼──────────┤
│ 1 │ 0.856388 │ 0.322763 │
│ 2 │ 0.360254 │ missing │
│ 3 │ 0.229875 │ 0.906697 │
│ 4 │ 0.275965 │ 0.485042 │
│ 5 │ 0.126336 │ 0.205509 │
│ 6 │ 0.879974 │ 0.752962 │
│ 7 │ 0.0518579 │ missing │
│ 8 │ 0.512231 │ 0.759513 │
│ 9 │ 0.309586 │ missing │
│ 10 │ 0.616471 │ 0.978771 │
Additionally if you want to recode the whole DataFrame
(all columns) into a new data frame you can use colwise
:
julia> DataFrame(colwise(x -> recode(x, 9999.0=>missing), df), names(df))
10×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───────────┼──────────┤
│ 1 │ 0.856388 │ 0.322763 │
│ 2 │ 0.360254 │ missing │
│ 3 │ 0.229875 │ 0.906697 │
│ 4 │ 0.275965 │ 0.485042 │
│ 5 │ 0.126336 │ 0.205509 │
│ 6 │ 0.879974 │ 0.752962 │
│ 7 │ 0.0518579 │ missing │
│ 8 │ 0.512231 │ 0.759513 │
│ 9 │ 0.309586 │ missing │
│ 10 │ 0.616471 │ 0.978771 │
missing
s when creating DataFrame
Here it depends on the package you use to load the data. For instance if you use CSV.jl you can add null="-9999"
keyword argument to CSV.read
. In more complex cases you can use transforms
keyword argument and e.g. use an adjusted version of val2missing
proposed by Dan there.
Upvotes: 6
Reputation: 18227
The following example defines a function (named val2missing
) to replace a specified value with missing
and then applies this function using the .
broadcast notation. The example first generates a random dataset (to simulate the weather data) with 9999.0 (this data is in df
). It then creates df2
with missing
values from df
:
julia> using DataFrames, Missings
julia> df = DataFrame(x=rand(10),y=[rand()<0.3 ? 9999.0 : rand() for i=1:10])
10×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼──────────┼───────────┤
│ 1 │ 0.926893 │ 0.9355 │
│ 2 │ 0.961057 │ 9999.0 │
│ 3 │ 0.308849 │ 9999.0 │
│ 4 │ 0.936876 │ 0.303936 │
│ 5 │ 0.541762 │ 0.5957 │
│ 6 │ 0.953278 │ 9999.0 │
│ 7 │ 0.987931 │ 9999.0 │
│ 8 │ 0.904365 │ 0.0227427 │
│ 9 │ 0.640827 │ 0.960697 │
│ 10 │ 0.66238 │ 0.769464 │
julia> val2missing(v,mv) = mv == v ? missing : v
val2missing (generic function with 1 method)
julia> df2 = DataFrame(x=df[:x], y=val2missing.(df[:y],9999.0))
10×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼──────────┼───────────┤
│ 1 │ 0.926893 │ 0.9355 │
│ 2 │ 0.961057 │ missing │
│ 3 │ 0.308849 │ missing │
│ 4 │ 0.936876 │ 0.303936 │
│ 5 │ 0.541762 │ 0.5957 │
│ 6 │ 0.953278 │ missing │
│ 7 │ 0.987931 │ missing │
│ 8 │ 0.904365 │ 0.0227427 │
│ 9 │ 0.640827 │ 0.960697 │
│ 10 │ 0.66238 │ 0.769464 │
Upvotes: 0