Emma
Emma

Reputation: 41

Pyspark dataframe: How to remove duplicate rows in a dataframe in databricks

I have an existing dataframe in databricks which contains many rows are exactly the same in all column values. example like below:

df:

No. Name Age Country
1 John 20 US
1 John 20 US
2 Cici 25 Japan
3 Tom 36 Canada
3 Tom 36 Canada
3 Tom 36 Canada

I want to have the below finally.

No. Name Age Country
1 John 20 US
2 Cici 25 Japan
3 Tom 36 Canada

How to write the scripts? Thank you

Upvotes: 1

Views: 508

Answers (1)

notNull
notNull

Reputation: 31490

use either distinct (or) dropDuplicates() functions on the dataframe.

Example:

df.distinct().show()

(or)

df.dropDuplicates().show()

Sample code:

df = spark.createDataFrame([(1,'John',20,'US'),(1,'John',20,'US'),(1,'John',20,'US'),(2,'CICI',25,'Japan')],['No.','Name','Age','country'])
df.distinct().show()
df.dropDuplicates().show()
#output
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+
#
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+

Upvotes: 0

Related Questions