Reputation: 185
I have a dataframe DF1 that has three columns city1,city2,distance. I want to create a new dataframe DF2 from DF1 by dropping one of those rows which are same
i am trying to check the distance btw cities. As city (A and B) or (B and A) will have same diatnce . i need to drop one of them
city1 city2 dist
A B 100
A C 200
B A 100
C B 200
so in this, need to drop either of 1st or 3rd row because they both are considered as same
output expected
city1 city2 dist
A B 100
A C 200
C B 200
Upvotes: 0
Views: 51
Reputation: 31510
sort by creating an array combined all columns
then use row_number
function to get the first record from the window.
Example:
df.show()
#using some sample records from post
#+-----+-----+----+
#|city1|city2|dist|
#+-----+-----+----+
#| A| B| 100|
#| A| C| 200|
#| B| A| 100|
#+-----+-----+----+
df1=df.withColumn("new_cnct",array_join(array_sort(array(col("city1"),col("city2"),col("dist"))),'')).\
withColumn("s_id",spark_partition_id())
from pyspark.sql import *
from pyspark.sql.window import *
w=Window.partitionBy("new_cnct").orderBy("s_id")
df1.withColumn("rn",row_number().over(w)).\
filter(col("rn") ==1).\
drop("rn","s_id","new_cnct").\
show(10,False)
#+-----+-----+----+
#|city1|city2|dist|
#+-----+-----+----+
#|A |B |100 |
#|A |C |200 |
#+-----+-----+----+
Upvotes: 1