Reputation: 91
Simply, let's say I had the following DataFrame:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| James| Sales| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
| Saif| Sales| 4100|
+-------------+----------+------+
How could I group by department and get all other values into a list, as follows:
department | employee_name | salary |
---|---|---|
Sales | [James, Michael, Robert, James, Saif] | [3000, 4600, 4100, 3000, 4100] |
Finance | [Maria, Scott, Jen] | [3000, 3300, 3900] |
Marketing | [Jeff, Kumar] | [3000, 2000] |
Upvotes: 2
Views: 3404
Reputation: 26676
Lets try with minimal typing;
df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()
Upvotes: 2
Reputation: 31460
Use collect_list
with groupBy
clause
from pyspark.sql.functions import *
df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))
Upvotes: 4