Mahsa Hassankashi
Mahsa Hassankashi

Reputation: 2139

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection A=[1, 2, 3, 4] B=[5, 6, 7, 8]

Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)

My problem for covariance computation is:

1) I can not have a nested aggregate function when I write

SUM((ai-avg(a)) * (bi-avg(b)))

2) Or in another shape, how can I extract two collection with one reduce such as:

REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))

3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated

REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))

I mean that can I write nested reduce?

4) Is there any ways with "unwind", "extract"

Thank you in advanced for any help.

Upvotes: 6

Views: 2830

Answers (4)

cybersam
cybersam

Reputation: 66999

[EDITED]

This should calculate the covariance (according to your formula), given your sample inputs:

WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;

This approach is OK when n is small, as is the case with the original sample data.

However, as @NicoleWhite and @jjaderberg point out, when n is not small, this approach will be inefficient. The answer by @NicoleWhite is an elegant general solution.

Upvotes: 6

Nicole White
Nicole White

Reputation: 7790

cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:

WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
     SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;

Edit:

Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
   | aa | bb
---+----+----
 1 |  1 |  4
 2 |  1 |  5
 3 |  1 |  6
 4 |  2 |  4
 5 |  2 |  5
 6 |  2 |  6
 7 |  3 |  4
 8 |  3 |  5
 9 |  3 |  6

Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |     2.0 |     5.0

Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.

Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |   500.0 |  1500.0

714 ms

Is significantly slower than using REDUCE:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
       REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
   | e_a   | e_b   
---+-------+--------
 1 | 500.0 | 1500.0

4 ms

To bring it all together, I'll compare the two queries in full on length-1000 vectors:

> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
 covariance;
   | covariance
---+------------
 1 |    83583.5

9105 ms

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
          SIZE(a) AS n, a, b
          RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
   | cov    
---+---------
 1 | 83583.5

33 ms

Upvotes: 7

Mahsa Hassankashi
Mahsa Hassankashi

Reputation: 2139

Thank you so much Dears, however I wonder which one is most efficient

1) Nested unwind and range inside reduce -> @cybersam

2) nested Reduce -> @Nicole White

3) Nested With (reset query by with) -> @jjaderberg

BUT Important Issue is :

Why there is an error and difference between your computations and real and actual computations.

I mean your covariance equals to = 1.6666666666666667

But in real world covariance equals to = 1.25

please check: https://www.easycalculation.com/statistics/covariance.php

Vector X: [1, 2, 3, 4] Vector Y: [5, 6, 7, 8]

enter image description here

enter image description here

I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

enter image description here

Upvotes: 0

jjaderberg
jjaderberg

Reputation: 9952

How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:

WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg

and for both collections

WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg

Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do

RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance

Upvotes: 4

Related Questions