Reputation: 41
I am trying to find common values among the groups created by applying groupBy and pivot on a dataframe in pySpark.
For example, the data looks like:
+--------+---------+---------+
|PlayerID|PitcherID|ThrowHand|
+--------+---------+---------+
|10000598| 10000104| R|
|10000908| 10000104| R|
|10000489| 10000104| R|
|10000734| 10000104| R|
|10006568| 10000104| R|
|10000125| 10000895| L|
|10000133| 10000895| L|
|10006354| 10000895| L|
|10000127| 10000895| L|
|10000121| 10000895| L|
After applying:
df.groupBy('PlayerID').pivot('ThrowHand').agg(F.count('ThrowHand')).drop('null').show(10)
I get something like:-
+--------+----+---+
|PlayerID| L | R|
+--------+----+---+
|10000591| 11| 43|
|10000172| 22|101|
|10000989| 05| 19|
|10000454| 05| 17|
|10000723| 11| 33|
|10001989| 11| 38|
|10005243| 20| 60|
|10003366| 11| 26|
|10006058| 02| 09|
+--------+----+---+
is there someway I can get common values of 'PitcherID' among the count of L and R in the above.
What i mean is for PlayerID =10000591, I have 11 PitcherID where ThrowHand is L and 43 PitcherID where ThrowHand is 43. It is possible that some Pitchers are common in this 11 and 43 Pitchers grouped.
Is there any way I can get these common PitcherID?
Upvotes: 1
Views: 1041
Reputation: 41957
You should first get the collection of pitcherIds for each throwhand as
import pyspark.sql.functions as F
#collect set of pitchers in addition to count of ThrowHand
df = df.groupBy('PlayerID').pivot('ThrowHand').agg(F.count('ThrowHand').alias('count'), F.collect_set('PitcherID').alias('PitcherID')).drop('null')
which should give you dataframe
as
root
|-- PlayerID: string (nullable = true)
|-- L_count: long (nullable = false)
|-- L_PitcherID: array (nullable = true)
| |-- element: string (containsNull = true)
|-- R_count: long (nullable = false)
|-- R_PitcherID: array (nullable = true)
| |-- element: string (containsNull = true)
Then write a udf
function to get the common pitcherID
s as
#columns with pitcherid and count
pitcherColumns = [x for x in df.columns if 'PitcherID' in x]
countColumns = [x for x in df.columns if 'count' in x]
#udf function to find the common pitcher between the collected pitchers
@F.udf(T.ArrayType(T.StringType()))
def commonFindingUdf(*pitcherCols):
common = pitcherCols[0]
for pitcher in pitcherCols[1:]:
common = set(common).intersection(pitcher)
return [x for x in common]
#calling the udf function and selecting the required columns
df.select(F.col('PlayerID'), commonFindingUdf(*[col(x) for x in pitcherColumns]).alias('common_PitcherID'), *countColumns)
which should give you final dataframe
as
root
|-- PlayerID: string (nullable = true)
|-- common_PitcherID: array (nullable = true)
| |-- element: string (containsNull = true)
|-- L_count: long (nullable = false)
|-- R_count: long (nullable = false)
I hope the answer is helpful
Upvotes: 1