timbram
timbram

Reputation: 1865

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:

df = spark.createDataFrame([(2,'john',1),
                            (2,'john',1),
                            (3,'pete',8),
                            (3,'pete',8),
                            (5,'steve',9)],
                           ['id','name','value'])

df.show()

+----+-------+-------+--------------+
| id | name  | value |     date     |
+----+-------+-------+--------------+
|  2 | john  |     1 | 131434234342 |
|  2 | john  |     1 | 10-22-2018   |
|  3 | pete  |     8 | 10-22-2018   |
|  3 | pete  |     8 | 3258958304   |
|  5 | steve |     9 | 124324234    |
+----+-------+-------+--------------+

I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:

+----+-------+-------+-----------+
| id | name  | value |   date    |
+----+-------+-------+-----------+
|  5 | steve |     9 | 124324234 |
+----+-------+-------+-----------+

How can I do this in PySpark?

Upvotes: 1

Views: 2570

Answers (2)

User12345
User12345

Reputation: 5480

Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below

import pyspark.sql.functions as f

df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt') 

You can add the date column in the GroupBy condition if you want

Hope this helps you

Upvotes: 1

plalanne
plalanne

Reputation: 1030

You could groupBy id, name and value and filter on the count column : :

df = df.groupBy('id','name','value').count().where('count = 1')

df.show()

+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
|  5|steve|    9|    1|
+---+-----+-----+-----+

You could eventually drop the count column if needed

Upvotes: 2

Related Questions