Reputation: 2580
I have a DataFrame with two columns, id1, id2
and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2))
.
How can I do that with PySpark?
Thanks!
Please note that this isn't a duplicate as I'd like for PySpark to calculate the count()
. Of course it's possible to get the two lists id1_distinct
and id2_distinct
and put them in a set()
but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit
Upvotes: 0
Views: 573
Reputation: 42422
You can combine the two columns into one using union
, and get the countDistinct
:
import pyspark.sql.functions as F
cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]
Upvotes: 1