Oblivion
Oblivion

Reputation: 635

Scala Spark dataframe join result not in preferred order

I'm have a dataframe called stores_df that contains store information such as date and sales. I have another dataframe called avg_sales_store_by_month that contains the average sales for each month of each store. I wish to get the average sales column from this to append it to stores_df. The issue i have is after my join, the order of stores_df gets changed

Below is the first few rows from stores_df.

+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
|Store|      Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price|        CPI|Unemployment|Month|Year|Day|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
|    1|2010-02-05|    FALSE|   1|       24924|      42.31|     2.572|211.0963582|       8.106|    2|2010|  5|
|    1|2010-02-12|     TRUE|   1|       46039|      38.51|     2.548|211.2421698|       8.106|    2|2010| 12|
|    1|2010-02-19|    FALSE|   1|       41595|      39.93|     2.514|211.2891429|       8.106|    2|2010| 19|
|    1|2010-05-14|    FALSE|   1|       18926|      74.78|     2.854|210.3374261|       7.808|    5|2010| 14|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+

Below is the first few rows of avg_sales_store_by_month, i wish to grab the last column and append it to the end of stores_df.

+-----+-----+------------------+
|Store|Month|avg_sales_by_month|
+-----+-----+------------------+
|   39|   11|          23317.75|
|   43|    7|          13090.84|
|   10|    2|          28407.05|
|   23|    6|           21265.7|
|    4|   10|           28723.2|
|    9|   10|            8468.2|
+-----+-----+------------------+

My issue is when i use my join:

stores_df = stores_df.join( avg_sales_store_by_month, Seq("Store", "Month"), "left" )

The rows of stores_df gets reordered, I would like it to be in the same order as before the join but with the extra column. How do i achieve this?

After the join snippet, order is messed up.

+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
|Store|Month|      Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price|        CPI|Unemployment|Year|Day|avg_sales_by_month|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
|   39|   11|2010-11-05|    FALSE|   1|       31729|      61.62|     2.689|210.7202444|       8.476|2010|  5|          23317.75|
|   39|   11|2010-11-12|    FALSE|   1|       12324|      62.21|     2.728|210.7667944|       8.476|2010| 12|          23317.75|
|   39|   11|2010-11-19|    FALSE|   1|       15137|       55.5|     2.771|  210.65429|       8.476|2010| 19|          23317.75|
|   39|   11|2011-11-11|    FALSE|   2|       65758|      63.11|     3.297|216.7217373|       7.716|2011| 11|          23317.75|
|   39|   11|2011-11-18|    FALSE|   2|       70050|      66.09|     3.308|216.9395861|       7.716|2011| 18|          23317.75|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+

Upvotes: 0

Views: 3393

Answers (1)

Leo C
Leo C

Reputation: 22439

If you want to preserve the original column order, you can save the first dataframe's columns along with the additional column in an Array and select them after the join, as in the following example:

val df1 = Seq(
  (1, 25000, 3, 2010, 20),
  (1, 30000, 3, 2010, 27),
  (1, 20000, 4, 2010, 3),
  (2, 40000, 3, 2010, 20),
  (2, 35000, 3, 2010, 27),
  (2, 35000, 4, 2010, 3)
).toDF("Store", "Wk_Sales", "Month", "year", "Day")

val df2 = Seq(
  (1, 3, 100000),
  (1, 4, 90000),
  (2, 3, 140000),
  (2, 4, 110000)
).toDF("Store", "Month", "Mo_Sales")

val joinedDF = df1.join(df2, Seq("Store", "Month"), "left")
// +-----+-----+--------+----+---+--------+
// |Store|Month|Wk_Sales|year|Day|Mo_Sales|
// +-----+-----+--------+----+---+--------+
// |    1|    3|   25000|2010| 20|  100000|
// |    1|    3|   30000|2010| 27|  100000|
// |    1|    4|   20000|2010|  3|   90000|
// |    2|    3|   40000|2010| 20|  140000|
// |    2|    3|   35000|2010| 27|  140000|
// |    2|    4|   35000|2010|  3|  110000|
// +-----+-----+--------+----+---+--------+

val cols = df1.columns :+ "Mo_Sales"

joinedDF.select(cols.head, cols.tail: _*).
  show
// +-----+--------+-----+----+---+--------+
// |Store|Wk_Sales|Month|year|Day|Mo_Sales|
// +-----+--------+-----+----+---+--------+
// |    1|   25000|    3|2010| 20|  100000|
// |    1|   30000|    3|2010| 27|  100000|
// |    1|   20000|    4|2010|  3|   90000|
// |    2|   40000|    3|2010| 20|  140000|
// |    2|   35000|    3|2010| 27|  140000|
// |    2|   35000|    4|2010|  3|  110000|
// +-----+--------+-----+----+---+--------+

Upvotes: 2

Related Questions