Mohammad Saad
Mohammad Saad

Reputation: 2005

Checking elements in multiple columns of Dataframe in Julia

I have a question regarding the use of conditions in any loop while manipulating on DataFrame.

For example, I have a DataFrame

df:

a b c

1 2 5
3 4 3
2 1 7
6 3 6
5 1 9

I am trying to write a loop with a condition which checks on two cols (a and b) at a time and if the value i is available in either or both column then it should take the values from column c and store it in an array.

Using which I can later perform the statistical operations like finding mean of the array.

I have written a simplified code snippet for this task:

for i in 1:5
  result1 = Float64[]
  result2 = Float64[]
  if (df[:, :a] = i) 
      push!(result1, df[:, :c])
  elseif (df[:, :b] = i)
      push!(result2, df[:, :c])
  end

  unique!(result1)
  unique!(result2)

  result = vcat(result1, result2)

  global mean_val = mean(result)
end

Here, the i value will range from 1 to 5 and for each value both the columns a and b will be checked for its existence, if the value exist then value in column c should be pushed to the respected result array.

I have tried using some other suggestions from community like:

Code Example 1:


for i in 1:5
  mean_val = mean(df[:, :c] for i in ("a", "b")
end

Code Example 2:

for i in 1:5
  df.row = axes(df, 1)
  mean_val = mean((filter(x->x[:a] == i || x[:b] == i ,df))[:c])
end

However these do not work and return a desired output.

Please advice on my mistake in the code. Also, please do suggest if there is any document which explains about implementing multiple conditions in a statement, and accessing dataframe elements for any other operations in julia.

Thank you in advance

Upvotes: 2

Views: 1272

Answers (2)

A first way to do what (I think) you want to achieve would be to use the indexing syntax to take a subset of your dataframe:

julia> using DataFrames
julia> df = DataFrame(a = rand(1:5, 10), b = rand(1:5, 10), c = rand(1:100, 10))
10×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2     25
   2 │     5      4     72
   3 │     4      3     37
   4 │     4      3     46
   5 │     3      2     31
   6 │     3      5     43
   7 │     5      1     35
   8 │     5      2     54
   9 │     1      1     64
  10 │     1      4     57
julia> idx = (df.a .== 3) .| (df.b .== 3)
10-element BitArray{1}:
 0
 0
 1
 1
 1
 1
 0
 0
 0
 0

julia> filtered_c = df[idx, :c]
4-element Array{Int64,1}:
 37
 46
 31
 43

You can then compute any statistics you want on the resulting filtered values:

julia> using Statistics

julia> mean(filtered_c)
39.25

Another way of doing the same thing would rely on the use of filter to filter the rows you want to keep:

julia> filtered_df = filter(row -> (row.a==3 || row.b==3), df)
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

# This way of writing things is equivalent to the previous one, but
# might be more readable in cases where the condition you're checking
# is more complex
julia> filtered_df = filter(df) do row
           row.a == 3 || row.b == 3
       end
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     4      3     37
   2 │     4      3     46
   3 │     3      2     31
   4 │     3      5     43

julia> mean(filtered_df.c)
39.25

Upvotes: 4

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

As a small efficiency note to an excellent answer by François Févotte, it is faster to do:

julia> filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
4×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      5      1
   2 │     3      5      9
   3 │     4      3     74
   4 │     4      3     63

if you have a very large data frame. There are two differences here:

  1. I use a [:a, :b] => (a,b) -> a == 3 || b == 3 synax, which is type stable (so it will iterate rows faster);
  2. I use view=true to produce a view of a source data frame, which allocates much less (it might matter for very large data frames);

Here is a small example of different row subsetting options you have on a larger data frame:

julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));

julia> function test()
           @time filter(row -> (row.a==3 || row.b==3), df)
           @time df[(df.a .== 3) .| (df.b .== 3), :]
           @time @view df[(df.a .== 3) .| (df.b .== 3), :]
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
           @time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
           return nothing
       end
test (generic function with 1 method)

julia> test()
 19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
  1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
  0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
  1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
  0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)

julia> test()
 18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
  1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
  0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
  0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
  0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)

Upvotes: 3

Related Questions