How to replace dummy values with missing in Julia Dataframes?

Question

I have a set of weather data from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/.

The dataset includes temperatures and rainfall etc. and uses -9999 as a dummy value to represent missing data.

I would like to replace that value with missing in a DataFrame so that it will not be included in statistical calculations or plots. Is there a way I can do this as I create the dataframe? Or can it be done after the dataframe is created?

Bogumił Kamiński · Accepted Answer

Additionally to what Dan Getz proposes there are two options:

use `recode` function

(the recode function is defined in the CategoricalArrays.jl package so you need to load it first)

I am using Dan's example:

julia> df = DataFrame(x=rand(10),y=[rand()<0.3 ? 9999.0 : rand() for i=1:10])
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ 9999.0   │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ 9999.0   │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ 9999.0   │
│ 10  │ 0.616471  │ 0.978771 │

julia> df[:y] = recode(df[:y], 9999.0=>missing)
10-element Array{Union{Float64, Missings.Missing},1}:
 0.322763
  missing
 0.906697
 0.485042
 0.205509
 0.752962
  missing
 0.759513
  missing
 0.978771

julia> df
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ missing  │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ missing  │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ missing  │
│ 10  │ 0.616471  │ 0.978771 │

Additionally if you want to recode the whole DataFrame (all columns) into a new data frame you can use colwise:

julia> DataFrame(colwise(x -> recode(x, 9999.0=>missing), df), names(df))
10×2 DataFrames.DataFrame
│ Row │ x         │ y        │
├─────┼───────────┼──────────┤
│ 1   │ 0.856388  │ 0.322763 │
│ 2   │ 0.360254  │ missing  │
│ 3   │ 0.229875  │ 0.906697 │
│ 4   │ 0.275965  │ 0.485042 │
│ 5   │ 0.126336  │ 0.205509 │
│ 6   │ 0.879974  │ 0.752962 │
│ 7   │ 0.0518579 │ missing  │
│ 8   │ 0.512231  │ 0.759513 │
│ 9   │ 0.309586  │ missing  │
│ 10  │ 0.616471  │ 0.978771 │

detect `missing`s when creating `DataFrame`

Here it depends on the package you use to load the data. For instance if you use CSV.jl you can add null="-9999" keyword argument to CSV.read. In more complex cases you can use transforms keyword argument and e.g. use an adjusted version of val2missing proposed by Dan there.

How to replace dummy values with missing in Julia Dataframes?

Answers (2)

use `recode` function

detect `missing`s when creating `DataFrame`

Related Questions

How to replace dummy values with missing in Julia Dataframes?

Answers (2)

use recode function

detect missings when creating DataFrame

Related Questions

use `recode` function

detect `missing`s when creating `DataFrame`