xiaodai
xiaodai

Reputation: 16004

Julia: How to compute mean by group using aggregate for IndexedTables.jl?

I am trying to use the aggregate function to compute the mean of a variable by group

using Distributions, PooledArrays

N=Int64(2e9/8); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]
pool1 = [@sprintf "id%010d" k for k in 1:(N/K)]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end

using JuliaDB
DT = IndexedTable(Columns([1:N;]), Columns(
  id1 = randstrarray(pool, N),
  v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
 ));

res = IndexedTables.aggregate(mean, DT, by=(:id1,), with=:v3)

How I get the error

MethodError: no method matching mean(::Float64, ::Float64)
Closest candidates are:
  mean(!Matched::Union{Function, Type}, ::Any) at statistics.jl:19
  mean(!Matched::AbstractArray{T,N} where N, ::Any) where T at statistics.jl:57
  mean(::Any) at statistics.jl:34
in  at base\<missing>
in #aggregate#144 at IndexedTables\src\query.jl:119
in aggregate_to at IndexedTables\src\query.jl:148

however

IndexedTables.aggregate(+ , DT, by=(:id1,), with=:v3)

works fine

Upvotes: 0

Views: 1315

Answers (3)

Liso
Liso

Reputation: 2260

Edit:

res = IndexedTables.aggregate_vec(mean, DT, by=(:id1,), with=:v3)

from help:

help?> IndexedTables.aggregate_vec

aggregate_vec(f::Function, x::IndexedTable) Combine adjacent rows with equal indices using a function from vector to scalar, e.g. mean.


Old answer:

(I keep it because it was pleasant exercise (for me) how to create helper type and functions if something doesn't work like we want. Maybe it could help someone in future :)


I am not sure how do you like to aggregate mean. My idea is to calculate "center of gravity" for points with equivalent mass.

center of two points: G = (A+B)/2

adding (aggregating) third point C is (2G+C)/3 (2G because G's mass is A's mass +B's mass)

etc.

struct Atractor
     center::Float64
     mass::Int64
end

" two points create new atractor with double mass "
mediocre(a::Float64, b::Float64) = Atractor((a+b)/2, 2)

# pls forgive me function's name! :) 

" aggregate new point to atractor "
function mediocre(a::Atractor, b::Float64)
    mass = a.mass + 1  
    Atractor((a.center*a.mass+b)/mass, mass)
end

Test:

tst_array = rand(Float64, 100);

isapprox(mean(tst_array), reduce(mediocre, tst_array).center)
true  # at least in my tests! :) 

mean(tst_array) == reduce(mediocre, tst_array).center  # sometimes true

For aggregate function we need a little more work:

import Base.convert

" we need method for convert Atractor to Float64 because aggregate
  function wants to store result in Float64 "
convert(Float64, x::Atractor) = x.center

And now it (probably :P) works

res = IndexedTables.aggregate(mediocre, DT, by=(:id1,), with=:v3)
id1     │ 
────────┼────────
"id001" │ 45.9404
"id002" │ 47.0032
"id003" │ 46.0846
"id004" │ 47.2567
...

I hope you see that aggregating mean has impact to precision! (there is more sum and divide operations)

Upvotes: 1

Chris Rackauckas
Chris Rackauckas

Reputation: 19132

You need to tell it how to reduce two numbers to one. mean is for arrays. So just use an anonymous function:

res = IndexedTables.aggregate((x,y)->(x+y)/2, DT, by=(:id1,), with=:v3)

Upvotes: 1

tamasgal
tamasgal

Reputation: 26259

I'd really like to help you, but it took me 10 minutes to install all the packages and another few minutes to run the code and figuring out what it actually does (or doesn't). It would be great if you'd provide a "minimal working example", which focusses on the problem. In fact, the only requirement to reproduce your problem is seemingly IndexedTables and two random arrays.

(Sorry, this is not a complete answer, but too long to be a comment.)

Anyways, if you read the docstring of IndexedTables.aggregate, you see that it requires a function which takes two arguments and obviously returns a single value::

help?> IndexedTables.aggregate
  aggregate(f::Function, arr::IndexedTable)

  Combine adjacent rows with equal indices using the given 2-argument
  reduction function, returning the result in a new array.

You see in the error message you posted, that there is

no method matching mean(::Float64, ::Float64)

Since I don't know what you expect to be calculated, I now assume that you want to calculate the mean value of the two numbers. In this case you can define another method for mean():

Base.mean(x, y) = (x+y) / 2

This will fulfil the aggregate function signature requirements. But I am not sure if this is what you want.

Upvotes: 0

Related Questions