Reputation: 2005
I have a question regarding the use of conditions in any loop while manipulating on DataFrame.
For example, I have a DataFrame
df:
a b c
1 2 5
3 4 3
2 1 7
6 3 6
5 1 9
I am trying to write a loop with a condition which checks on two cols (a and b
) at a time and if the value i
is available in either or both column then it should take the values from column c
and store it in an array.
Using which I can later perform the statistical operations like finding mean of the array.
I have written a simplified code snippet for this task:
for i in 1:5
result1 = Float64[]
result2 = Float64[]
if (df[:, :a] = i)
push!(result1, df[:, :c])
elseif (df[:, :b] = i)
push!(result2, df[:, :c])
end
unique!(result1)
unique!(result2)
result = vcat(result1, result2)
global mean_val = mean(result)
end
Here, the i
value will range from 1 to 5 and for each value both the columns a and b
will be checked for its existence, if the value exist then value in column c
should be pushed to the respected result array.
I have tried using some other suggestions from community like:
Code Example 1:
for i in 1:5
mean_val = mean(df[:, :c] for i in ("a", "b")
end
Code Example 2:
for i in 1:5
df.row = axes(df, 1)
mean_val = mean((filter(x->x[:a] == i || x[:b] == i ,df))[:c])
end
However these do not work and return a desired output.
Please advice on my mistake in the code. Also, please do suggest if there is any document which explains about implementing multiple conditions in a statement, and accessing dataframe elements for any other operations in julia.
Thank you in advance
Upvotes: 2
Views: 1272
Reputation: 20298
A first way to do what (I think) you want to achieve would be to use the indexing syntax to take a subset of your dataframe:
julia> using DataFrames
julia> df = DataFrame(a = rand(1:5, 10), b = rand(1:5, 10), c = rand(1:100, 10))
10×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 25
2 │ 5 4 72
3 │ 4 3 37
4 │ 4 3 46
5 │ 3 2 31
6 │ 3 5 43
7 │ 5 1 35
8 │ 5 2 54
9 │ 1 1 64
10 │ 1 4 57
julia> idx = (df.a .== 3) .| (df.b .== 3)
10-element BitArray{1}:
0
0
1
1
1
1
0
0
0
0
julia> filtered_c = df[idx, :c]
4-element Array{Int64,1}:
37
46
31
43
You can then compute any statistics you want on the resulting filtered values:
julia> using Statistics
julia> mean(filtered_c)
39.25
Another way of doing the same thing would rely on the use of filter
to filter the rows you want to keep:
julia> filtered_df = filter(row -> (row.a==3 || row.b==3), df)
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 3 37
2 │ 4 3 46
3 │ 3 2 31
4 │ 3 5 43
# This way of writing things is equivalent to the previous one, but
# might be more readable in cases where the condition you're checking
# is more complex
julia> filtered_df = filter(df) do row
row.a == 3 || row.b == 3
end
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 3 37
2 │ 4 3 46
3 │ 3 2 31
4 │ 3 5 43
julia> mean(filtered_df.c)
39.25
Upvotes: 4
Reputation: 69949
As a small efficiency note to an excellent answer by François Févotte, it is faster to do:
julia> filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
4×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 5 1
2 │ 3 5 9
3 │ 4 3 74
4 │ 4 3 63
if you have a very large data frame. There are two differences here:
[:a, :b] => (a,b) -> a == 3 || b == 3
synax, which is type stable (so it will iterate rows faster);view=true
to produce a view of a source data frame, which allocates much less (it might matter for very large data frames);Here is a small example of different row subsetting options you have on a larger data frame:
julia> df = DataFrame(a=rand(1:3, 10^8), b=rand(1:3, 10^8), c=rand(10^8));
julia> function test()
@time filter(row -> (row.a==3 || row.b==3), df)
@time df[(df.a .== 3) .| (df.b .== 3), :]
@time @view df[(df.a .== 3) .| (df.b .== 3), :]
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df)
@time filter([:a, :b] => (a,b) -> a == 3 || b == 3, df, view=true)
return nothing
end
test (generic function with 1 method)
julia> test()
19.912672 seconds (333.67 M allocations: 6.652 GiB, 5.71% gc time, 0.41% compilation time)
1.152460 seconds (29 allocations: 1.667 GiB, 14.88% gc time)
0.515334 seconds (15 allocations: 435.807 MiB, 40.49% gc time)
1.066756 seconds (412.82 k allocations: 1.689 GiB, 5.56% gc time, 12.54% compilation time)
0.646710 seconds (382.98 k allocations: 455.835 MiB, 31.27% gc time, 23.02% compilation time)
julia> test()
18.194791 seconds (333.34 M allocations: 6.635 GiB, 4.87% gc time)
1.018816 seconds (29 allocations: 1.667 GiB, 15.34% gc time)
0.469027 seconds (15 allocations: 435.807 MiB, 41.19% gc time)
0.912572 seconds (30 allocations: 1.667 GiB, 5.32% gc time)
0.480374 seconds (16 allocations: 435.807 MiB, 41.15% gc time)
Upvotes: 3