pyspark count distinct on each column

Question

I'm brand new the pyspark (and really python as well). I'm trying to count distinct on each column (not distinct combinations of columns). I want the answer to this SQL statement:

sqlStatement = "Select Count(Distinct C1) AS C1, Count(Distinct C2) AS C2, ..., Count(Distinct CN) AS CN From myTable"

distinct_count = spark.sql(sqlStatement).collect()

That takes forever (16 hours) on an 8-node cluster (see configuration below). I'm trying to optimize a 100GB dataset with 400 columns. I am not seeing a way of using dataframe sql primitives like:

df.agg(countDistinct('C1', 'C2', ..., 'CN'))

as that will again give me unique combinations. There must be a way to make this fast.

Master node Standard (1 master, N workers) Machine type
n1-highmem-8 (8 vCPU, 52.0 GB memory) Primary disk size
500 GB Worker nodes
8 Machine type
n1-highmem-4 (4 vCPU, 26.0 GB memory) Primary disk size
500 GB Local SSDs
1

pyspark count distinct on each column

Answers (1)

Related Questions