Barry Andersen
Barry Andersen

Reputation: 609

How to replace dummy values with missing in Julia Dataframes?

I have a set of weather data from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/.

The dataset includes temperatures and rainfall etc. and uses -9999 as a dummy value to represent missing data.

I would like to replace that value with missing in a DataFrame so that it will not be included in statistical calculations or plots. Is there a way I can do this as I create the dataframe? Or can it be done after the dataframe is created?

Upvotes: 5

Views: 1863

Answers (2)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

Additionally to what Dan Getz proposes there are two options:

use recode function

(the recode function is defined in the CategoricalArrays.jl package so you need to load it first)

I am using Dan's example:

julia> df = DataFrame(x=rand(10),y=[rand()<0.3 ? 9999.0 : rand() for i=1:10])
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ 9999.0   │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ 9999.0   │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ 9999.0   │
│ 10  │ 0.616471  │ 0.978771 │

julia> df[:y] = recode(df[:y], 9999.0=>missing)
10-element Array{Union{Float64, Missings.Missing},1}:
 0.322763
  missing
 0.906697
 0.485042
 0.205509
 0.752962
  missing
 0.759513
  missing
 0.978771

julia> df
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ missing  │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ missing  │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ missing  │
│ 10  │ 0.616471  │ 0.978771 │

Additionally if you want to recode the whole DataFrame (all columns) into a new data frame you can use colwise:

julia> DataFrame(colwise(x -> recode(x, 9999.0=>missing), df), names(df))
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ missing  │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ missing  │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ missing  │
│ 10  │ 0.616471  │ 0.978771 │

detect missings when creating DataFrame

Here it depends on the package you use to load the data. For instance if you use CSV.jl you can add null="-9999" keyword argument to CSV.read. In more complex cases you can use transforms keyword argument and e.g. use an adjusted version of val2missing proposed by Dan there.

Upvotes: 6

Dan Getz
Dan Getz

Reputation: 18227

The following example defines a function (named val2missing) to replace a specified value with missing and then applies this function using the . broadcast notation. The example first generates a random dataset (to simulate the weather data) with 9999.0 (this data is in df). It then creates df2 with missing values from df:

julia> using DataFrames, Missings

julia> df = DataFrame(x=rand(10),y=[rand()<0.3 ? 9999.0 : rand() for i=1:10])
10×2 DataFrames.DataFrame
│ Row │ x        │ y         │
├─────┼──────────┼───────────┤
│ 1   │ 0.926893 │ 0.9355    │
│ 2   │ 0.961057 │ 9999.0    │
│ 3   │ 0.308849 │ 9999.0    │
│ 4   │ 0.936876 │ 0.303936  │
│ 5   │ 0.541762 │ 0.5957    │
│ 6   │ 0.953278 │ 9999.0    │
│ 7   │ 0.987931 │ 9999.0    │
│ 8   │ 0.904365 │ 0.0227427 │
│ 9   │ 0.640827 │ 0.960697  │
│ 10  │ 0.66238  │ 0.769464  │

julia> val2missing(v,mv) = mv == v ? missing : v
val2missing (generic function with 1 method)

julia> df2 = DataFrame(x=df[:x], y=val2missing.(df[:y],9999.0))
10×2 DataFrames.DataFrame
│ Row │ x        │ y         │
├─────┼──────────┼───────────┤
│ 1   │ 0.926893 │ 0.9355    │
│ 2   │ 0.961057 │ missing   │
│ 3   │ 0.308849 │ missing   │
│ 4   │ 0.936876 │ 0.303936  │
│ 5   │ 0.541762 │ 0.5957    │
│ 6   │ 0.953278 │ missing   │
│ 7   │ 0.987931 │ missing   │
│ 8   │ 0.904365 │ 0.0227427 │
│ 9   │ 0.640827 │ 0.960697  │
│ 10  │ 0.66238  │ 0.769464  │

Upvotes: 0

Related Questions