Reputation: 793
I would like to do some aggregates such as collect_list on non-numeric fields and then transpose to multiple columns based on a different column.
Here is a simple data set and my desired result.
df = sqlContext.createDataFrame([
("John", 'Male', 'GA'),
("Mary", "Female", 'GA'),
("Alex", "Male", "NY"),
("Ana", "Female", "NY"),
("Amy", "Female", "NY")
], ["Name", "gender", "state"])
I'm able to group by 2 fields as follows.
(df
.groupBy('state', 'gender')
.agg(collect_list('Name'))
).show()
+-----+------+------------------+
|state|gender|collect_list(Name)|
+-----+------+------------------+
| NY| Male| [Alex]|
| GA| Male| [John]|
| GA|Female| [Mary]|
| NY|Female| [Ana, Amy]|
+-----+------+------------------+
How can I transpose it to two fields based on the gender field? Here is what I'm looking for.
state | male | female
--------------------------
GA | ['John'] | ['Mary']
NY | ['Alex'] | ['Ana', 'Amy']
Upvotes: 1
Views: 184
Reputation: 7399
You can transpose using pivot()
after a groupby
. Spark also has a collect_list()
aggregation.
df.groupBy('state').pivot('gender').agg(F.collect_list('Name')).show()
# +-----+----------+------+
# |state| Female| Male|
# +-----+----------+------+
# | NY|[Ana, Amy]|[Alex]|
# | GA| [Mary]|[John]|
# +-----+----------+------+
Upvotes: 1