PySpark: How to count the number of distinct values from two columns?

Question

I have a DataFrame with two columns, id1, id2 and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2)).

How can I do that with PySpark?

Thanks!

Please note that this isn't a duplicate as I'd like for PySpark to calculate the count(). Of course it's possible to get the two lists id1_distinct and id2_distinct and put them in a set() but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit

mck · Accepted Answer

You can combine the two columns into one using union, and get the countDistinct:

import pyspark.sql.functions as F

cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]

PySpark: How to count the number of distinct values from two columns?

Answers (1)

Related Questions