Ostap Strashevskii
Ostap Strashevskii

Reputation: 141

How to remove duplicates from DataFrame in Spark basing on particular columns?

How to get DataFrame which is not contain the row duplicates?(Containing only unique rows, which "uniqness" hold on particular columns)

for example? we have DataFrame:

1 2 3 4
2 3 4 5
2 3 4 4 

we need to remove rows with equal first three fields:

e see that 2 3 4 = 2 3 4 So result Dataframe would be

1 2 3 4

Upvotes: 1

Views: 288

Answers (2)

yogesh garud
yogesh garud

Reputation: 336

Adding alternative approach to Mahesh Answer -

spark.sql("""
select 1 as t1,2 as t2,3 as t3,4 as t4 union all
select 2 as t1,3 as t2,4 as t3,5 as t4  union all
select 2 as t1,3 as t2,4 as t3,4 as t4 
""").createOrReplaceTempView("table1")
spark.sql("""select * from table1 where concat(t1,t2,t3) not in (select concat(t1,t2,t3) as c1 from table1 group by c1 having count(*) >1)""").show()

output -

+---+---+---+---+
| t1| t2| t3| t4|
+---+---+---+---+
|  1|  2|  3|  4|
+---+---+---+---+

Upvotes: 1

Mahesh Gupta
Mahesh Gupta

Reputation: 1892

I am using spark with left anti join to achieve the desire output.

from pyspark.sql import functions as F

data = [[1,2,3,4],[2,3,4,5],[2,3,4,4]]
schema = ['id','id1','id2','id3']
df = spark.createDataFrame(data,schema)
# here i am just concateing the the field on which we need to remove duplicate records 

df2 = df.withColumn("concat",F.concat_ws('',F.col("id"),F.col("id1"),F.col("id2")))

df2.registerTempTable("temp")
result = spark.sql("select concat, count(1) from temp group by concat having count(1)>1")

df2.join(result, result.concat==df2.concat,how="left_anti").drop("concat").show()

+---+---+---+---+
| id|id1|id2|id3|
+---+---+---+---+
|  1|  2|  3|  4|
+---+---+---+---+

Let me know if you need further clarification on it.

Upvotes: 4

Related Questions