Spark Create new column containing min and max of corresponding values in certain other column

Question

Let's say I have a column

    import spark.implicits._
    
    val simpleData = Seq(("James", "Sales", 3000),
        ("Michael", "Sales", 4600),
        ("Robert", "Sales", 4100),
        ("Maria", "Finance", 3000),
        ("James", "Sales", 3000),
        ("Scott", "Finance", 3300),
        ("Jen", "Finance", 3900),
        ("Jeff", "Marketing", 3000),
        ("Kumar", "Marketing", 2000),
        ("Saif", "Sales", 4100))
        
    val df_1 = simpleData.toDF("employee_name", "department", "salary")

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

I an ideal case I would like to add columns min_department_salary max_department_salary min_salary_employee_name and max_salary_employee_name to that original dataframe. These would tell for every row for what is min and max salary, who is getting it.

So first row would be James, Sales, 3000, 3000, 4600, James, Michael

What I have now is

    val df_1_5 = df_1.groupBy('department)
                    .agg(min('salary).as("min_department_salary"), max('salary).as("max_department_salary"))

+----------+---------------------+---------------------+
|department|min_department_salary|max_department_salary|
+----------+---------------------+---------------------+
|     Sales|                 3000|                 4600|
|   Finance|                 3000|                 3900|
| Marketing|                 2000|                 3000|
+----------+---------------------+---------------------+

This is not quite there yet and I have tried join in there with the original df. I would like to avoid joins as I have a quite big dataframe.

koiralo · Accepted Answer

You can use struct to preserve the other column as below

df1.withColumn("sal-name", struct($"salary", $"employee_name"))
  .groupBy('department)
  .agg(min("sal-name").as("min"), max("sal-name").as("max"))
  .select($"department", $"min.*", $"max.*")
  .toDF("department", "min_sal", "min_sal_name", "max_sal", "min_sal_name")
  .show(false)

Output:

+----------+-------+------------+-------+------------+
|department|min_sal|min_sal_name|max_sal|min_sal_name|
+----------+-------+------------+-------+------------+
|Sales     |3000   |James       |4600   |Michael     |
|Finance   |3000   |Maria       |3900   |Jen         |
|Marketing |2000   |Kumar       |3000   |Jeff        |
+----------+-------+------------+-------+------------+

If you want all the rows then you can use window function instead of groupBy

val window = Window.partitionBy("department")

df1.withColumn("sal-name", struct($"salary", $"employee_name"))
  .withColumn("min", min("sal-name").over(window))
  .withColumn("max", max("sal-name").over(window))
  .select($"employee_name", $"department", $"min.*", $"max.*")
  .toDF("employee_name" ,"department", "min_sal", "min_sal_name", "max_sal", "min_sal_name")
  .show(false)

Output:

+-------------+----------+-------+------------+-------+------------+
|employee_name|department|min_sal|min_sal_name|max_sal|min_sal_name|
+-------------+----------+-------+------------+-------+------------+
|James        |Sales     |3000   |James       |4600   |Michael     |
|Michael      |Sales     |3000   |James       |4600   |Michael     |
|Robert       |Sales     |3000   |James       |4600   |Michael     |
|James        |Sales     |3000   |James       |4600   |Michael     |
|Saif         |Sales     |3000   |James       |4600   |Michael     |
|Maria        |Finance   |3000   |Maria       |3900   |Jen         |
|Scott        |Finance   |3000   |Maria       |3900   |Jen         |
|Jen          |Finance   |3000   |Maria       |3900   |Jen         |
|Jeff         |Marketing |2000   |Kumar       |3000   |Jeff        |
|Kumar        |Marketing |2000   |Kumar       |3000   |Jeff        |
+-------------+----------+-------+------------+-------+------------+

Spark Create new column containing min and max of corresponding values in certain other column

Answers (2)

Related Questions