Alexis Eggermont
Alexis Eggermont

Reputation: 8145

Spark reduceByKey on several different values

I have a table stored as an RDD of lists, on which I want to perform something akin to a groupby in SQL or pandas, taking the sum or average for every variable.

The way I currently do it is this (untested code):

l=[(3, "add"),(4, "add")]
dict={}
i=0
for aggregation in l:
    RDD= RDD.map(lambda x: (x[6], float(x[aggregation[0]])))
    agg=RDD.reduceByKey(aggregation[1])
    dict[i]=agg
    i+=1

Then I'll need to join all the RDDs in dict.

This isn't very efficient though. Is there a better way?

Upvotes: 1

Views: 1070

Answers (1)

nrg
nrg

Reputation: 186

If you are using >= Spark 1.3, you could look at the DataFrame API.

In the pyspark shell:

import numpy as np
# create a DataFrame (this can also be from an RDD)
df = sqlCtx.createDataFrame(map(lambda x:map(float, x), np.random.rand(50, 3)))
df.agg({col: "mean" for col in df.columns}).collect()

This outputs:

[Row(AVG(_3#1456)=0.5547187588389414, AVG(_1#1454)=0.5149476209374797, AVG(_2#1455)=0.5022967093047612)]

The available aggregate methods are "avg"/"mean", "max", "min", "sum", "count".

To get several aggregations for the same column, you can call agg with a list of explicitly constructed aggregations rather than with a dictionary:

from pyspark.sql import functions as F
df.agg(*[F.min(col) for col in df.columns] + [F.avg(col) for col in df.columns]).collect()

Or for your case:

df.agg(F.count(df.var3), F.max(df.var3), ) # etc...

Upvotes: 3

Related Questions