Reputation: 5104
This link and others tell me that the Spark groupByKey
is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy
function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
Upvotes: 5
Views: 2495
Reputation: 28392
A DataFrame groupBy
followed by an agg
will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey
and use a reducebyKey
or combineByKey
where possible. Some situations, however, do require one to use groupByKey
.
The normal way to do this type of operation with the DataFrame API is to use groupBy
followed by an aggregation using agg
. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max
function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max
there are a multitude of functions that can be used in combination with agg
, see here for all operations.
Upvotes: 4
Reputation: 91
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
Upvotes: 0