TinaK
TinaK

Reputation: 41

Pyspark: Get the amount of distinct combinations between two columns

I need to be able to get the number of distinct combinations in two separate columns.

In this example from the "Animal" and "Color" columns, the result I want to get is 3, since three distinct combinations of the columns occur. Basically, Animal or Color can be the same among separate rows, but if two rows have the same Animal AND Color, it should be omitted from this count.

Animal | Color
Dog    | Brown
Dog    | White
Cat    | Black
Dog    | White

I know you can add data to a set and that will eliminate duplicates, but I couldn't seem to get it to work with multiple variables.

Here is the example code I was given to attempt to solve this.

d = d.rdd
d = d.map(lambda row: (row.day.year, row.number))
print(d.take(2000))
d_maxNum = d.reduceByKey(lambda max_num, this_num: this_num if this_num > max_num else max_num)
print(d_maxNum.collect())

Upvotes: 2

Views: 4467

Answers (2)

Shantanu Sharma
Shantanu Sharma

Reputation: 4099

You can use distinct function.

##Perform distinct on entire dataframe.
df.distinct().show()

##Perform distinct on certain columns of dataframe
df.select('Animal','Color').distinct().show()

Upvotes: 1

wypul
wypul

Reputation: 837

Pyspark has dropDuplicates method refer that you can use.

df = sc.parallelize([Row(Animal='Dog', Color='White'), Row(Animal='Dog', Color='Black'), Row(Animal='Dog', Color='White'), Row(Animal='Cat', Color='White')]).toDF()

df.dropDuplicates(['Animal', 'Color']).count()

which will give output as 3.

Upvotes: 3

Related Questions