Randomize
Randomize

Reputation: 9103

What is the equivalent of Spark Dataframe's dropDuplicates in Spark SQL?

I have some code in Spark (3.0/3.1) written in this way:

foo.join(bar, Seq("col1","col2","col3"),"inner").dropDuplicates("col1","col2")

where foo and bar are two generic Dataframes.

How does it translate to Spark SQL? I cannot find an equivalent definition to dropDuplicates as:

select distinct(col1, col2), * ....

seems to be something different. Any idea?

Upvotes: 1

Views: 517

Answers (1)

Mohana B C
Mohana B C

Reputation: 5487

You can use ranking function- row_number().

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._

val df = List((1, 2, 3), (1, 2, 4), (1, 4, 6)).toDF("col1", "col2", "col3")

df.dropDuplicates("col1", "col2").show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|   3|
|   1|   4|   6|
+----+----+----+*/

df.createOrReplaceTempView("table")
spark.sql(
    """
      |select col1, col2, col3 from (
      |   select *, row_number() over (partition by col1, col2 order by col1, col2) as rn from table)
      |   where rn = 1
      |""".stripMargin).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|   3|
|   1|   4|   6|
+----+----+----+*/

Upvotes: 2

Related Questions