Ajar
Ajar

Reputation: 1826

Julia: Create DataFrame column from expression?

Given this:

dict = Dict(("y" => ":x / 2"))

df = DataFrame(x = [1, 2, 3, 4])

df
4×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
│ 4   │ 4     │

I want to make this:

4×2 DataFrame
│ Row │ x     │ y       │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 0.5     │
│ 2   │ 2     │ 1.0     │
│ 3   │ 3     │ 1.5     │
│ 4   │ 4     │ 2.0     │

This seems like a perfect application for DataFramesMeta, either @with or @eachrow, but I haven't been able to get my expression to evaluate as expected in an environment where :x exists.

Basically, I want to be able to iterate over (k, v) pairs in dict and create one new column for each Symbol(k) with corresponding values eval(Meta.parse(v)), or something along those lines, where the evaluation occurs such that Symbols like :x exist at the time of evaluation.

I didn't expect this to work, and it doesn't:

[df[Symbol(k)] = eval(Meta.parse(v)) for (k, v) in dict]

ERROR: MethodError: no method matching /(::Symbol, ::Int64)

But this illustrates the problem: I need the expressions to be evaluated in an environment where the symbols they contain exist.

However, moving it inside a @with doesn't work:

using DataFramesMeta

@with(df, [eval(Meta.parse(v)) for (k, v) in dict])

ERROR: MethodError: no method matching /(::Symbol, ::Int64)

Using @eachrow fails the same way:

using DataFramesMeta

@eachrow df begin
           for (k, v) in dict
               @newcol tmp::Vector{Float32}
               tmp = eval(Meta.parse(v))
           end
       end

ERROR: MethodError: no method matching /(::Symbol, ::Int64)

I'm guessing I'm unclear on some key element of how DataFramesMeta creates an environment within a DataFrame. I also don't necessarily have to use DataFramesMeta for this, any reasonably concise option will work since I can encapsulate it in a package function.

Note: I control the format of the strings to be parsed into expressions, but I want to avoid complexity such as specifying the name of the DataFrame object in the string, or broadcasting every operation. I want the expression syntax in the initial string to be reasonably clear to non-Julia programmers.

UPDATE: I tried all three solutions in the comments on this question, and they have a problem: they don't work inside functions.

dict = Dict(("y" => ":x / 2"))

data = DataFrame(x = [1, 2, 3, 4])


function transform_from_dict(df, dict)

    new = eval(Meta.parse("@transform(df, " * join(join.(collect(dict), " = "), ", ") * ")"))

    return new

end

transform_from_dict(data, dict)

ERROR: UndefVarError: df not defined

Or:

function transform_from_dict!(df, dict)

    [df[!, Symbol(k)] = eval(:(@with(df, $(Meta.parse(v))))) for (k, v) in dict]

    return nothing

end

transform_from_dict!(data, dict)

ERROR: UndefVarError: df not defined

Upvotes: 3

Views: 327

Answers (2)

questionto42
questionto42

Reputation: 9512

I have worked on this answer in parallel to @Ajar, nothing is copied from that answer nor did I know about it. I was totally new to Julia so I had to install it (because I thought the online compilers did not even know a DataFrame), later I understood that these packages must be called at start anyway, be it online or offline. I have added the package information that beginners might need to know.

using Pkg 
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")

using DataFrames
using DataFramesMeta 
dict = Dict(("y" => ":x / 2"))
df = DataFrame(x = [1, 2, 3, 4])

The @with solution:

julia> function transform_from_dict!(k, v)
           global df
           df[!, Symbol(k)] = eval(:(@with(df, $(Meta.parse(v)))))
           return nothing
       end
transform_from_dict! (generic function with 2 methods)
julia> [transform_from_dict!(k, v) for (k, v) in dict]
1-element Array{Nothing,1}:
 nothing
julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      0.5
   2 │     2      1.0
   3 │     3      1.5
   4 │     4      2.0

The @transform solution:

julia> function transform_from_dict(df, dict)
           global new
           new = eval(Meta.parse("@transform(df, " * join(join.(collect(dict), " = "), ", ") * ")"))

           return new

       end
transform_from_dict (generic function with 1 method)
julia>

julia> transform_from_dict(data, dict)
4×2 DataFrame
 Row │ x      y
     │ Int64  Float64
─────┼────────────────
   1 │     1      0.5
   2 │     2      1.0
   3 │     3      1.5
   4 │     4      2.0

Thanks go to the other commentators, the essential ideas listed in @Ajar's answer.

Upvotes: 1

Ajar
Ajar

Reputation: 1826

OK, combining answers from all of the commenters works!

using DataFrames
using DataFramesMeta

dict = Dict(("y" => ":x / 2"))

data = DataFrame(x = [1, 2, 3, 4])

@张实唯's approach using @with:

# using @with
function transform_from_dict1(df, dict)

    global df

    [df[!, Symbol(k)] = eval(:(@with(df, $(Meta.parse(v))))) for (k, v) in dict]

    return df

end

transform_from_dict1(data, dict)
# 4×2 DataFrame
# │ Row │ x     │ y       │
# │     │ Int64 │ Float64 │
# ├─────┼───────┼─────────┤
# │ 1   │ 1     │ 0.5     │
# │ 2   │ 2     │ 1.0     │
# │ 3   │ 3     │ 1.5     │
# │ 4   │ 4     │ 2.0     │

And @Bogumił Kamiński's approach using @transform:

# using @transform
function transform_from_dict2(df, dict)

    global df

    new_df = eval(Meta.parse("@transform(df, " * join(join.(collect(dict), " = "), ", ") * ")"))

    return new_df

end

transform_from_dict2(data, dict)
# 4×2 DataFrame
# │ Row │ x     │ y       │
# │     │ Int64 │ Float64 │
# ├─────┼───────┼─────────┤
# │ 1   │ 1     │ 0.5     │
# │ 2   │ 2     │ 1.0     │
# │ 3   │ 3     │ 1.5     │
# │ 4   │ 4     │ 2.0     │

Both incorporate the fix from @Lorenz using global.

Note that the second form uses about 2.5x more memory than the first, likely due to the creation of a second DataFrame:

julia> @allocated transform_from_dict1(data, dict)
853948

julia> @allocated transform_from_dict2(data, dict)
22009111

I also think the first form is a little more clear, so that's what I'm using internally.

Note that you may need to broadcast logical operators if you have those in your transforms, and that as usual you'll need to handle any missing data issues up front.

Upvotes: 1

Related Questions