Pyspark dataframe: How to remove duplicate rows in a dataframe in databricks

Question

I have an existing dataframe in databricks which contains many rows are exactly the same in all column values. example like below:

df:

No.	Name	Age	Country
1	John	20	US
1	John	20	US
2	Cici	25	Japan
3	Tom	36	Canada
3	Tom	36	Canada
3	Tom	36	Canada

I want to have the below finally.

No.	Name	Age	Country
1	John	20	US
2	Cici	25	Japan
3	Tom	36	Canada

How to write the scripts? Thank you

notNull · Accepted Answer

use either distinct (or) dropDuplicates() functions on the dataframe.

Example:

df.distinct().show()

(or)

df.dropDuplicates().show()

Sample code:

df = spark.createDataFrame([(1,'John',20,'US'),(1,'John',20,'US'),(1,'John',20,'US'),(2,'CICI',25,'Japan')],['No.','Name','Age','country'])
df.distinct().show()
df.dropDuplicates().show()
#output
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+
#
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+

Pyspark dataframe: How to remove duplicate rows in a dataframe in databricks

Answers (1)

Related Questions