Reputation: 160
I would like to apply one of two possible functions to a column of a DataFrame
, based on its category (one for each function) which is specified in another column.
The use case I have for this is converting the JD times of a list of observations from two separate locations to BJD (Barycentric Julian Date), which is dependent on that location.
For example, if I had a table like:
julia> using DataFrames
julia> df = DataFrame(:t => 1:5, :location => rand(["locA", "locB"], 5))
5×2 DataFrame
Row │ t location
│ Int64 String
─────┼─────────────────
1 │ 1 locB
2 │ 2 locA
3 │ 3 locA
4 │ 4 locB
5 │ 5 locA
where the BJD time at locA
= t^2 and the time at locB
= -t^2 (just for clarity), how could I add this new column?
One way I have tried is to just apply a ternary function:
julia> loc_to_BJD(t, loc) = loc == "locA" ? t^2 : -t^2
loc_to_BJD (generic function with 1 method)
julia> df.loc_to_BJD = loc_to_BJD.(df.t, df.location); df
5×3 DataFrame
Row │ t location loc_to_BJD
│ Int64 String Int64
─────┼─────────────────────────────
1 │ 1 locB -1
2 │ 2 locA 4
3 │ 3 locA 9
4 │ 4 locB -16
5 │ 5 locA 25
but running a check on the category each time like this seems awfully inefficient (unless the compiler is doing something clever behind the scenes?)
Would using groups make more sense in this case? I run into the problem of trying to add a column to a SubDataFrame
when trying this though, so I do not think that I am approaching this from the right direction.
Upvotes: 2
Views: 495
Reputation: 13800
I'm not sure why you think this would be inefficient, as it seems to me that someone's gotta go and check on those location
s to see if they're A or not, but IANACS. A quick benchmark on three ways of doing this that came to mind:
julia> using BenchmarkTools, DataFrames
julia> function add1(d)
loc_to_BJD(t, loc) = loc == "locA" ? t^2 : -t^2
d[!, "loc_to_BJD"] = loc_to_BJD.(d.t, d.location)
return d
end
add1 (generic function with 1 method)
julia> function add2(d)
d[!, "loc_to_BJD"] = ifelse.(d.location .== "locA", d.t .^ 2, (-1) .* d.t .^ 2)
end
add2 (generic function with 1 method)
julia> function add3(d)
transform!(d, [:t, :location] => ByRow((t, l) -> t == "locA" ? t^2 : -t^2) => :loc_to_BJD)
end
add3 (generic function with 1 method)
julia> @btime add1($df);
10.627 ms (6 allocations: 7.63 MiB)
julia> @btime add2($df);
11.291 ms (19 allocations: 7.63 MiB)
julia> @btime add3($df);
1.965 ms (163 allocations: 7.64 MiB)
I was surprised to see transform!
doing so much better here than the simple ifelse
/ternary operator versions, I feel like I've done something wrong in the benchmark but can't see what it is. Maybe check whether you find the same in your real world use case, but I'm sure Bogumil will be along shortly to explain what's going on :)
EDIT: With thanks to Bogumil's comment below, here's the correct add3
:
julia> function add3(d)
transform!(d, [:t, :location] => ByRow((t, l) -> l == "locA" ? t^2 : -t^2) => :loc_to_BJD)
end
add3 (generic function with 1 method)
julia> @btime add3($df);
11.177 ms (163 allocations: 7.64 MiB)
So no mystery after all!
Upvotes: 3