Reputation: 43
Let's assume I have a dataset which contains the following data :
data = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Smith','M',62),('Jake','Rose','M',21) ]
I now want to remove all row that contains the same last name and gender (first and third row in the above dataset) using Pyspark.
Thank you for your time 👍
Upvotes: 0
Views: 57
Reputation: 6998
with_duplicates = data.groupBy("last_name", "gender").agg(count("*").alias("count")).where(col("count") > 1)
without_duplicates = data.join(with_duplicates, ["last_name", "gender"], "left_anti")
Upvotes: 0