user1311888
user1311888

Reputation: 793

How to do an aggregate and transpose in PySpark?

I would like to do some aggregates such as collect_list on non-numeric fields and then transpose to multiple columns based on a different column.

Here is a simple data set and my desired result.

df = sqlContext.createDataFrame([
    ("John", 'Male', 'GA'),
    ("Mary", "Female", 'GA'),
    ("Alex", "Male", "NY"),
    ("Ana", "Female", "NY"),
    ("Amy", "Female", "NY")   
], ["Name", "gender", "state"])

I'm able to group by 2 fields as follows.

(df
 .groupBy('state', 'gender')
 .agg(collect_list('Name'))
).show()
+-----+------+------------------+
|state|gender|collect_list(Name)|
+-----+------+------------------+
|   NY|  Male|            [Alex]|
|   GA|  Male|            [John]|
|   GA|Female|            [Mary]|
|   NY|Female|        [Ana, Amy]|
+-----+------+------------------+

How can I transpose it to two fields based on the gender field? Here is what I'm looking for.

state |   male   | female
--------------------------
GA    | ['John'] | ['Mary']
NY    | ['Alex'] | ['Ana', 'Amy']

Upvotes: 1

Views: 184

Answers (1)

pissall
pissall

Reputation: 7399

You can transpose using pivot() after a groupby. Spark also has a collect_list() aggregation.

df.groupBy('state').pivot('gender').agg(F.collect_list('Name')).show()

# +-----+----------+------+
# |state|    Female|  Male|
# +-----+----------+------+
# |   NY|[Ana, Amy]|[Alex]|
# |   GA|    [Mary]|[John]|
# +-----+----------+------+

Upvotes: 1

Related Questions