Hugo Kitano
Hugo Kitano

Reputation: 91

PySpark DataFrame groupby into list of values?

Simply, let's say I had the following DataFrame:

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

How could I group by department and get all other values into a list, as follows:

department employee_name salary
Sales [James, Michael, Robert, James, Saif] [3000, 4600, 4100, 3000, 4100]
Finance [Maria, Scott, Jen] [3000, 3300, 3900]
Marketing [Jeff, Kumar] [3000, 2000]

Upvotes: 2

Views: 3404

Answers (2)

wwnde
wwnde

Reputation: 26676

Lets try with minimal typing;

df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()

Upvotes: 2

notNull
notNull

Reputation: 31460

Use collect_list with groupBy clause

from pyspark.sql.functions import *

df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))

Upvotes: 4

Related Questions