user3495160
user3495160

Reputation: 185

columns comparison in pyspark using dataframes

I have a dataframe DF1 that has three columns city1,city2,distance. I want to create a new dataframe DF2 from DF1 by dropping one of those rows which are same

i am trying to check the distance btw cities. As city (A and B) or (B and A) will have same diatnce . i need to drop one of them

city1  city2  dist
A      B      100
A      C      200
B      A      100
C      B      200

so in this, need to drop either of 1st or 3rd row because they both are considered as same

output expected

city1   city2  dist
A       B      100
A      C      200
C      B      200

Upvotes: 0

Views: 51

Answers (1)

notNull
notNull

Reputation: 31510

sort by creating an array combined all columns then use row_number function to get the first record from the window.

Example:

df.show()
#using some sample records from post
#+-----+-----+----+
#|city1|city2|dist|
#+-----+-----+----+
#|    A|    B| 100|
#|    A|    C| 200|
#|    B|    A| 100|
#+-----+-----+----+

df1=df.withColumn("new_cnct",array_join(array_sort(array(col("city1"),col("city2"),col("dist"))),'')).\
withColumn("s_id",spark_partition_id())

from pyspark.sql import *
from pyspark.sql.window import *

w=Window.partitionBy("new_cnct").orderBy("s_id")

df1.withColumn("rn",row_number().over(w)).\
filter(col("rn") ==1).\
drop("rn","s_id","new_cnct").\
show(10,False)
#+-----+-----+----+
#|city1|city2|dist|
#+-----+-----+----+
#|A    |B    |100 |
#|A    |C    |200 |
#+-----+-----+----+

Upvotes: 1

Related Questions