How to merge dataframes keeping order in spark or Python

Question

I am trying to merge two dataframes but keeping the order.

First dataframe has value:

>>> df_branch1.show(10,False)
+------------------------+
|col                     |
+------------------------+
|Sorter_SAMPLE_CUSTOMER  |
|Join_Source_Target      |
|Exp_DetectChanges       |
|Filter_Unchanged_Records|
|Router_UPDATE_INSERT    |
|Seq_Unique_Key          |
+------------------------+

Second dataframe has value:

>>> df_branch2.show(10,False)
+------------------------+                                                      
|col                     |
+------------------------+
|Sorter_CUSTOMER_MASTER  |
|Join_Source_Target      |
|Exp_DetectChanges       |
|Filter_Unchanged_Records|
|Router_UPDATE_INSERT    |
|Seq_Unique_Key          |
+------------------------+

I want to merge the dataframe but keep the order and expect the order to be preserved.

Output expect like:

+------------------------+                                                      
|col                     |
+------------------------+
|Sorter_SAMPLE_CUSTOMER  |
|Sorter_CUSTOMER_MASTER  |
|Join_Source_Target      |
|Exp_DetectChanges       |
|Filter_Unchanged_Records|
|Router_UPDATE_INSERT    |
|Seq_Unique_Key          |
+------------------------+

Any solution through pyspark or python should do

YOLO · Accepted Answer

Here's a way to do using a key column:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

# create a key column
d1 = d1.withColumn("key", F.monotonically_increasing_id())
d2 = d2.withColumn("key", F.monotonically_increasing_id())

# concat data
d3 = d1.union(d2)

# sort by key
d3 = d3.orderBy('key').drop('key')

w = Window().partitionBy("col1").orderBy('col1')
d4 = d3.withColumn("key", F.monotonically_increasing_id())
d4 = (d4
     .withColumn("dupe", F.row_number().over(w))
     .where("dupe == 1")
     .orderBy("key")
     .drop(*['key', 'dupe']))

d4.show()

+------------------------+
|col1                    |
+------------------------+
|Sorter_SAMPLE_CUSTOMER  |
|Sorter_CUSTOMER_MASTER  |
|Join_Source_Target      |
|Exp_DetectChanges       |
|Filter_Unchanged_Records|
|Router_UPDATE_INSERT    |
|Seq_Unique_Key          |
+------------------------+

How to merge dataframes keeping order in spark or Python

Answers (2)

Related Questions