Applying group specific functions to a Julia dataframe

Question

I would like to apply one of two possible functions to a column of a DataFrame, based on its category (one for each function) which is specified in another column.

The use case I have for this is converting the JD times of a list of observations from two separate locations to BJD (Barycentric Julian Date), which is dependent on that location.

For example, if I had a table like:

julia> using DataFrames

julia> df = DataFrame(:t => 1:5, :location => rand(["locA", "locB"], 5))
5×2 DataFrame
 Row │ t      location 
     │ Int64  String   
─────┼─────────────────
   1 │     1  locB
   2 │     2  locA
   3 │     3  locA
   4 │     4  locB
   5 │     5  locA

where the BJD time at locA = t^2 and the time at locB = -t^2 (just for clarity), how could I add this new column?

One way I have tried is to just apply a ternary function:

julia> loc_to_BJD(t, loc) = loc == "locA" ? t^2 : -t^2
loc_to_BJD (generic function with 1 method)

julia> df.loc_to_BJD = loc_to_BJD.(df.t, df.location); df
5×3 DataFrame
 Row │ t      location  loc_to_BJD 
     │ Int64  String    Int64      
─────┼─────────────────────────────
   1 │     1  locB              -1
   2 │     2  locA               4
   3 │     3  locA               9
   4 │     4  locB             -16
   5 │     5  locA              25

but running a check on the category each time like this seems awfully inefficient (unless the compiler is doing something clever behind the scenes?)

Would using groups make more sense in this case? I run into the problem of trying to add a column to a SubDataFrame when trying this though, so I do not think that I am approaching this from the right direction.

Nils Gudat · Accepted Answer

I'm not sure why you think this would be inefficient, as it seems to me that someone's gotta go and check on those locations to see if they're A or not, but IANACS. A quick benchmark on three ways of doing this that came to mind:

julia> using BenchmarkTools, DataFrames

julia> function add1(d)
           loc_to_BJD(t, loc) = loc == "locA" ? t^2 : -t^2
           d[!, "loc_to_BJD"] = loc_to_BJD.(d.t, d.location)
           return d
       end
add1 (generic function with 1 method)

julia> function add2(d)
           d[!, "loc_to_BJD"] = ifelse.(d.location .== "locA", d.t .^ 2, (-1) .* d.t .^ 2)
       end
add2 (generic function with 1 method)

julia> function add3(d)
           transform!(d, [:t, :location] => ByRow((t, l) -> t == "locA" ? t^2 : -t^2) => :loc_to_BJD)
       end
add3 (generic function with 1 method)

julia> @btime add1($df);
  10.627 ms (6 allocations: 7.63 MiB)

julia> @btime add2($df);
  11.291 ms (19 allocations: 7.63 MiB)

julia> @btime add3($df);
  1.965 ms (163 allocations: 7.64 MiB)

I was surprised to see transform! doing so much better here than the simple ifelse/ternary operator versions, I feel like I've done something wrong in the benchmark but can't see what it is. Maybe check whether you find the same in your real world use case, but I'm sure Bogumil will be along shortly to explain what's going on :)

EDIT: With thanks to Bogumil's comment below, here's the correct add3:

julia> function add3(d)
           transform!(d, [:t, :location] => ByRow((t, l) -> l == "locA" ? t^2 : -t^2) => :loc_to_BJD)
       end
add3 (generic function with 1 method)

julia> @btime add3($df);
  11.177 ms (163 allocations: 7.64 MiB)

So no mystery after all!

Applying group specific functions to a Julia dataframe

Answers (1)

Related Questions