Reputation: 41
I have an existing dataframe in databricks which contains many rows are exactly the same in all column values. example like below:
df:
No. | Name | Age | Country |
---|---|---|---|
1 | John | 20 | US |
1 | John | 20 | US |
2 | Cici | 25 | Japan |
3 | Tom | 36 | Canada |
3 | Tom | 36 | Canada |
3 | Tom | 36 | Canada |
I want to have the below finally.
No. | Name | Age | Country |
---|---|---|---|
1 | John | 20 | US |
2 | Cici | 25 | Japan |
3 | Tom | 36 | Canada |
How to write the scripts? Thank you
Upvotes: 1
Views: 508
Reputation: 31490
use either distinct
(or) dropDuplicates()
functions on the dataframe.
Example:
df.distinct().show()
(or)
df.dropDuplicates().show()
Sample code:
df = spark.createDataFrame([(1,'John',20,'US'),(1,'John',20,'US'),(1,'John',20,'US'),(2,'CICI',25,'Japan')],['No.','Name','Age','country'])
df.distinct().show()
df.dropDuplicates().show()
#output
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#| 1|John| 20| US|
#| 2|CICI| 25| Japan|
#+---+----+---+-------+
#
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#| 1|John| 20| US|
#| 2|CICI| 25| Japan|
#+---+----+---+-------+
Upvotes: 0