psych0groov3
psych0groov3

Reputation: 701

Creating additional columns using DataFrames based on condition on existing columns

I have a DataFrame with 3 columns: Col1, Col2, Col3. Each column contains an array of strings that has one of these values: "V1", "V2", "V3". For example, an array can look like this: ["V2", "V3"]

I want to create three new columns called: :V1, :V2, :V3 with values "Yes" in :V1 if :V1 appears in ANY of the arrays in columns Col1, Col2, Col3. Same for V2 and V3.

Here is my current code which works but got there by trial and error and wonder if there is a simpler solution (perhaps a one liner):

@chain df begin
transform(Cols(r"^Col") => ByRow((x...) -> any(any(y -> y == "V1", arr) for arr in x) ? "Yes" : "No") => :V1)
transform(Cols(r"^Col") => ByRow((x...) -> any(any(y -> y == "V2", arr) for arr in x) ? "Yes" : "No") => :V2)
transform(Cols(r"^Col") => ByRow((x...) -> any(any(y -> y == "V3", arr) for arr in x) ? "Yes" : "No") => :V3)
end

Is there a simpler approach and a way to make this one line as the logic is repeated.

Can I use in function to check if a value is contained in the array? How to implement a solution using in?

Upvotes: 1

Views: 43

Answers (2)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69899

Alternative one-liner:

transform(df, [AsTable(Cols(r"^Col")) => ByRow(x -> v in x ? "Yes" : "No") => v for v in string.("V", 1:3)])

Upvotes: 0

Andre Wildberg
Andre Wildberg

Reputation: 19088

A naive approach would be looping through the elements ["V1", "V2", "V3"] and comparing them with each element per row, then replacing the boolean vectors with the desired strings. Finally adding the vectors to the DataFrame.

using DataFrames

for ele in ["V1", "V2", "V3"]
  vec = []
  for row in eachrow(df)
    push!(vec, any(Vector(row) .== ele))
  end
  df[:,ele] = [i ? "Yes" : "No" for i in vec]
end

result

df
3×6 DataFrame
 Row │ Col1    Col2    Col3    V1      V2      V3     
     │ String  String  String  String  String  String 
─────┼────────────────────────────────────────────────
   1 │ V1      V2      V1      Yes     Yes     No
   2 │ V3      V1      V1      Yes     No      Yes
   3 │ V2      V3      V3      No      Yes     Yes
Data
df = DataFrame([["V1", "V3", "V2"],
                ["V2", "V1", "V3"],
                ["V1", "V1", "V3"]], 
                ["Col1", "Col2", "Col3"])

Upvotes: 2

Related Questions