pali56
pali56

Reputation: 79

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe. I have written the following code to achieve this but it is getting stuck and taking too much time to execute:

count_unique_items=[]

for j in range(len(cat_col)):
    var=cat_col[j]
    count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())

cat_col contains the column names of all the categorical variables

Is there any way to optimize this?

Upvotes: 0

Views: 3512

Answers (3)

Luis A.G.
Luis A.G.

Reputation: 1097

You can use get every different element of each column with

df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])

This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:

from pyspark.sql.functions import countDistinct

df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()

The part of the count, taken from here: check number of unique values in each column of a matrix in spark

Upvotes: 0

Randy Zwitch
Randy Zwitch

Reputation: 2064

You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.

from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in 
df.columns))

Upvotes: 0

user6022341
user6022341

Reputation:

Try using approxCountDistinct or countDistinct:

from pyspark.sql.functions import approxCountDistinct, countDistinct

counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()

but counting distinct elements is expensive.

Upvotes: 1

Related Questions