Arun Vijay
Arun Vijay

Reputation: 43

Removing rows based on condition Pyspark

Let's assume I have a dataset which contains the following data :

data = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Smith','M',62),('Jake','Rose','M',21) ]

I now want to remove all row that contains the same last name and gender (first and third row in the above dataset) using Pyspark.

Thank you for your time 👍

Upvotes: 0

Views: 57

Answers (1)

Robert Kossendey
Robert Kossendey

Reputation: 6998

with_duplicates = data.groupBy("last_name", "gender").agg(count("*").alias("count")).where(col("count") > 1)

without_duplicates = data.join(with_duplicates, ["last_name", "gender"], "left_anti")

Upvotes: 0

Related Questions