Checking elements in multiple columns of Dataframe in Julia

Question

I have a question regarding the use of conditions in any loop while manipulating on DataFrame.

For example, I have a DataFrame

I am trying to write a loop with a condition which checks on two cols (a and b) at a time and if the value i is available in either or both column then it should take the values from column c and store it in an array.

Using which I can later perform the statistical operations like finding mean of the array.

I have written a simplified code snippet for this task:

for i in 1:5
  result1 = Float64[]
  result2 = Float64[]
  if (df[:, :a] = i) 
      push!(result1, df[:, :c])
  elseif (df[:, :b] = i)
      push!(result2, df[:, :c])
  end

  unique!(result1)
  unique!(result2)

  result = vcat(result1, result2)

  global mean_val = mean(result)
end

Here, the i value will range from 1 to 5 and for each value both the columns a and b will be checked for its existence, if the value exist then value in column c should be pushed to the respected result array.

I have tried using some other suggestions from community like:

Code Example 1:


for i in 1:5
  mean_val = mean(df[:, :c] for i in ("a", "b")
end

Code Example 2:

for i in 1:5
  df.row = axes(df, 1)
  mean_val = mean((filter(x->x[:a] == i || x[:b] == i ,df))[:c])
end

However these do not work and return a desired output.

Please advice on my mistake in the code. Also, please do suggest if there is any document which explains about implementing multiple conditions in a statement, and accessing dataframe elements for any other operations in julia.

Thank you in advance

Fran&#231;ois F&#233;votte · Accepted Answer

A first way to do what (I think) you want to achieve would be to use the indexing syntax to take a subset of your dataframe:

julia> using DataFrames
julia> df = DataFrame(a = rand(1:5, 10), b = rand(1:5, 10), c = rand(1:100, 10))
10×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2     25
   2 │     5      4     72
   3 │     4      3     37
   4 │     4      3     46
   5 │     3      2     31
   6 │     3      5     43
   7 │     5      1     35
   8 │     5      2     54
   9 │     1      1     64
  10 │     1      4     57

julia> idx = (df.a .== 3) .| (df.b .== 3)
10-element BitArray{1}:
 0
 0
 1
 1
 1
 1
 0
 0
 0
 0

julia> filtered_c = df[idx, :c]
4-element Array{Int64,1}:
 37
 46
 31
 43

You can then compute any statistics you want on the resulting filtered values:

julia> using Statistics

julia> mean(filtered_c)
39.25

Another way of doing the same thing would rely on the use of filter to filter the rows you want to keep:

julia> filtered_df = filter(row -> (row.a==3 || row.b==3), df)
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

# This way of writing things is equivalent to the previous one, but
# might be more readable in cases where the condition you're checking
# is more complex
julia> filtered_df = filter(df) do row
           row.a == 3 || row.b == 3
       end
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

julia> mean(filtered_df.c)
39.25

Checking elements in multiple columns of Dataframe in Julia

Answers (2)

Related Questions