pp492
pp492

Reputation: 551

transform function on all columns of dataframe

I have a dataframe df and I am trying to apply a function to each of the cells. According to the documentation I should use the transform function.

The function should be applied to each column so I use [:] as a selector for all columns

transform(
    df, [:] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:]
)

but it yields an exception

ArgumentError: Unrecognized column selector: Colon() => (DataFrames.ByRow{Main.workspace293.var"#1#2"}(Main.workspace293.var"#1#2"()) => Colon())

although when I am using a single column, it works fine

transform(
    df, [:K0] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:K0]
)

Upvotes: 2

Views: 867

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69819

The simplest way to do it is to use broadcasting:

julia> df = DataFrame(2*rand(4,3), [:x1, :x2, :x3])
4×3 DataFrame
│ Row │ x1        │ x2       │ x3       │
│     │ Float64   │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┤
│ 1   │ 0.945879  │ 1.59742  │ 0.882428 │
│ 2   │ 0.0963367 │ 0.400404 │ 0.599865 │
│ 3   │ 1.23356   │ 0.807691 │ 0.547917 │
│ 4   │ 0.756098  │ 0.595673 │ 0.29678  │

julia> @. ifelse(df > 1, df, 0.0)
4×3 DataFrame
│ Row │ x1      │ x2      │ x3      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 0.0     │ 1.59742 │ 0.0     │
│ 2   │ 0.0     │ 0.0     │ 0.0     │
│ 3   │ 1.23356 │ 0.0     │ 0.0     │
│ 4   │ 0.0     │ 0.0     │ 0.0     │

you can also transform for it if you prefer:

julia> transform(df, names(df) .=> ByRow(x -> ifelse(x>1, x, 0.0)) .=> names(df))
4×3 DataFrame
│ Row │ x1      │ x2      │ x3      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 0.0     │ 1.59742 │ 0.0     │
│ 2   │ 0.0     │ 0.0     │ 0.0     │
│ 3   │ 1.23356 │ 0.0     │ 0.0     │
│ 4   │ 0.0     │ 0.0     │ 0.0     │

Also looking at the linked pandas solution DataFrames.jl seems faster in this case:

julia> df = DataFrame(2*rand(2,3), [:x1, :x2, :x3])
2×3 DataFrame
 Row │ x1       x2       x3       
     │ Float64  Float64  Float64  
─────┼────────────────────────────
   1 │ 1.48781  1.20332  1.08071
   2 │ 1.55462  1.66393  0.363993

julia> using BenchmarkTools

julia> @btime @. ifelse($df > 1, $df, 0.0)
  6.252 μs (58 allocations: 3.89 KiB)
2×3 DataFrame
 Row │ x1       x2       x3      
     │ Float64  Float64  Float64 
─────┼───────────────────────────
   1 │ 1.48781  1.20332  1.08071
   2 │ 1.55462  1.66393  0.0

(in pandas for 2x3 data frame it was ranging from 163 µs to 2.26 ms)

Upvotes: 7

Related Questions