Reputation: 7
This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
+-----------------+
| C1 | C2 |
+--------|--------+
| A | xxx |
| B | yyy |
| A | yyy |
| B | www |
| B | xxx |
| A | zzz |
| A | xxx |
| A | yyy |
+-----------------+
to
+--------------------------------------------+
| C1 | www | xxx | yyy | zzz |
+--------|--------|--------|--------|--------|
| A | 0 | 1 | 1 | 1 |
| B | 1 | 1 | 1 | 0 |
+--------------------------------------------+
How does one attain this in PySpark? Presence is 1 and absence is 0.
Upvotes: 0
Views: 127
Reputation: 24413
Yes, you will need pivot
. But for aggregation, in your case it's best just to use F.first(F.lit(1))
and when you get nulls, just replace them with 0
using df.fillna(0)
.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('A', 'xxx'),
('B', 'yyy'),
('A', 'yyy'),
('B', 'www'),
('B', 'xxx'),
('A', 'zzz'),
('A', 'xxx'),
('A', 'yyy')],
['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# | B| 1| 1| 1| 0|
# | A| 0| 1| 1| 1|
# +---+---+---+---+---+
Upvotes: 1