Reputation: 9103
I have some code in Spark (3.0/3.1) written in this way:
foo.join(bar, Seq("col1","col2","col3"),"inner").dropDuplicates("col1","col2")
where foo
and bar
are two generic Dataframes.
How does it translate to Spark SQL? I cannot find an equivalent definition to dropDuplicates
as:
select distinct(col1, col2), * ....
seems to be something different. Any idea?
Upvotes: 1
Views: 517
Reputation: 5487
You can use ranking function- row_number().
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = List((1, 2, 3), (1, 2, 4), (1, 4, 6)).toDF("col1", "col2", "col3")
df.dropDuplicates("col1", "col2").show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 1| 4| 6|
+----+----+----+*/
df.createOrReplaceTempView("table")
spark.sql(
"""
|select col1, col2, col3 from (
| select *, row_number() over (partition by col1, col2 order by col1, col2) as rn from table)
| where rn = 1
|""".stripMargin).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 1| 4| 6|
+----+----+----+*/
Upvotes: 2