Pyspark dataframe: How to randomly drop one row if there are two duplicated rows for the same primary key? Databricks

Question

I'm working in Azure databricks, pyspark There is now an existing dataframe like below

df:

id	Name	Age	Country
1	Mike	18	USA
2	Lily	25	Japan
2	Lily	22	Japan
3	John	23	China
4	Snow	36	Korea
5	Tom	28	Austria
5	Cindy	28	Italy

column "id" is the primary key and cannot have duplicated values, which means all the numbers should appear only once, but now as you could see, number 2 and 5 have two rows with different values. I need to keep only one row for id 2 and one row for id 5 only. We can drop randomly and keep the other row, or we can keep only the upper row if there are duplicated rows

I need to have the following:

id	Name	Age	Country
1	Mike	18	USA
2	Lily	25	Japan
3	John	23	China
4	Snow	36	Korea
5	Tom	28	Austria

or below is also fine:

id	Name	Age	Country
1	Mike	18	USA
2	Lily	22	Japan
3	John	23	China
4	Snow	36	Korea
5	Cindy	28	Italy

How should I write the scripts?

Thanks.

samkart · Accepted Answer

you could simply use dropDuplicates() with id as subset column.

see below example

# given the following input
# +---+---+---+---+
# | id| c1| c2| c3|
# +---+---+---+---+
# |  1|foo|bar|baz|
# |  1|foo|baz|bar|
# |  2|foo|bar|baz|
# |  2|foo|bar|baz|
# +---+---+---+---+

data_sdf. \
    dropDuplicates(subset=['id']). \
    show()

# +---+---+---+---+
# | id| c1| c2| c3|
# +---+---+---+---+
# |  1|foo|bar|baz|
# |  2|foo|bar|baz|
# +---+---+---+---+

Pyspark dataframe: How to randomly drop one row if there are two duplicated rows for the same primary key? Databricks

Answers (2)

Related Questions