Python Spark combineByKey Average

Question

I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.

The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: () missing 1 required positional argument: 'v' error.

Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:

In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors

It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.

zero323 · Accepted Answer

Step by step:

lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
```
lambda key, vals: ...
```
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
```
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
```
or:
```
def get_mean(key_vals):
    key, (total, cnt) = key_vals
    return key, total / cnt

cbk.map(get_mean)
```
You can also make this much simpler with mapValues:
```
cbk.mapValues(lambda x: x[0] / x[1])
```

Finally a numerically stable solution would be:

from pyspark.statcounter import StatCounter

(pRDD
    .combineByKey(
        lambda x: StatCounter([x]),
        StatCounter.merge,
        StatCounter.mergeStats)
    .mapValues(StatCounter.mean))

Python Spark combineByKey Average

Answers (2)

Related Questions