Reputation: 59
I have a datafrane to which I am applying a pivot transformation and I want to know if there is a way to have the same end result and avoid the pivot transformation. The dataframe looks like this:
|gender| pro|week| share|forecast|
+------+------------+----+-------------+--------+
| Male| A| 40| 0.2| 195.0|
|Female| A| 40| 0.01| 38.0|
| Male| B| 40| 0.15| 733.0|
|Female| B| 41| 0.15| 579.0|
|Female| C| 41| 0.01| 38.0|
The expected output os the following:
|gender| pro|week| share_1| share_10| share_15| share_20|
+------+---------+----+-----------+------------+------------+------------+
| Male| A| 40| 0.0| 0.0| 0.0| 195.0|
|Female| A| 40| 38.0| 0.0| 0.0| 0.0|
|Female| B| 41| 0.0| 0.0| 579.0| 0.0|
|Female| C| 41| 38.0| 0.0| 0.0| 0.0|
| Male| B| 40| 191.0| 205.0| 733.0| 245.0|
At the moment I am implementing this:
df.groupBy(['gender','pro','week']).pivot("share").agg(first('forecast')).withColumnRenamed('0.01', 'share_1').withColumnRenamed('0.1', 'share_10').withColumnRenamed('0.15', 'share_15').withColumnRenamed('0.2', 'share_20')
Is there a have the same result without applying a pivot transformation?
Upvotes: 1
Views: 395
Reputation: 15318
performances are poor because you do not provide values for the share column.
cf. doc pivot(pivot_col, values=None)
Not providing values is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
I can insure you that the current official implementation of pivot will always be better than anything you'll try by yourself. Just add your values and it will be good.
Upvotes: 2