Gravity Boy
Gravity Boy

Reputation: 7

How to map one column to multiple binary columns in Spark?

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:

+-----------------+
| C1     | C2     |
+--------|--------+
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
+-----------------+

to

+--------------------------------------------+
| C1     | www    | xxx    | yyy    | zzz    |
+--------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
+--------------------------------------------+

How does one attain this in PySpark? Presence is 1 and absence is 0.

Upvotes: 0

Views: 127

Answers (1)

ZygD
ZygD

Reputation: 24413

Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A', 'xxx'),
     ('B', 'yyy'),
     ('A', 'yyy'),
     ('B', 'www'),
     ('B', 'xxx'),
     ('A', 'zzz'),
     ('A', 'xxx'),
     ('A', 'yyy')],
    ['C1', 'C2'])

df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)

df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# |  B|  1|  1|  1|  0|
# |  A|  0|  1|  1|  1|
# +---+---+---+---+---+

Upvotes: 1

Related Questions