Reputation: 8145
I have a table stored as an RDD of lists, on which I want to perform something akin to a groupby in SQL or pandas, taking the sum or average for every variable.
The way I currently do it is this (untested code):
l=[(3, "add"),(4, "add")]
dict={}
i=0
for aggregation in l:
RDD= RDD.map(lambda x: (x[6], float(x[aggregation[0]])))
agg=RDD.reduceByKey(aggregation[1])
dict[i]=agg
i+=1
Then I'll need to join all the RDDs in dict.
This isn't very efficient though. Is there a better way?
Upvotes: 1
Views: 1070
Reputation: 186
If you are using >= Spark 1.3, you could look at the DataFrame API.
In the pyspark shell:
import numpy as np
# create a DataFrame (this can also be from an RDD)
df = sqlCtx.createDataFrame(map(lambda x:map(float, x), np.random.rand(50, 3)))
df.agg({col: "mean" for col in df.columns}).collect()
This outputs:
[Row(AVG(_3#1456)=0.5547187588389414, AVG(_1#1454)=0.5149476209374797, AVG(_2#1455)=0.5022967093047612)]
The available aggregate methods are "avg"/"mean", "max", "min", "sum", "count".
To get several aggregations for the same column, you can call agg
with a list of explicitly constructed aggregations rather than with a dictionary:
from pyspark.sql import functions as F
df.agg(*[F.min(col) for col in df.columns] + [F.avg(col) for col in df.columns]).collect()
Or for your case:
df.agg(F.count(df.var3), F.max(df.var3), ) # etc...
Upvotes: 3