pyspark groupBy and count across all columns

Question

I have a dataframe where every cell is a letter from A, B, C, or D. It is a very large dataframe. What's the most efficient way to come up with a count of the number of times each letter appears across the entire dataframe?

A very small example:

col1    col2    col3        col4
A       C       A               C
B       D       C               A

I want a count that looks like:

Does not have to be sorted.

For example, I was thinking of combining the columns into one and calling groupby and count on that:

one_column_df = df.withColumn("mycol", array("col1", " col2", "col3", "col4")).select(explode(col("mycol")))
one_column_df.groupBy("col").count().show()

Is there a more efficient way than having to call array and explode?

Addy · Accepted Answer

Hope this helps:

 data = [
    ('A'     ,  'C'     ,  'A'         ,      'C'),
    ('B',       'D'    ,   'C'       ,        'A')
     
    ]

df=spark.createDataFrame(data,["col1","col2","col3","col4"])
df.createOrReplaceTempView("parquetFile")
spark.sql(" with as1 as (\
          select count(col1)c1,col1 from parquetFile group by col1 \
         ),as2 as (\
          select count(col2)c2,col2 from parquetFile group by col2 \
         )\
          ,  as3 as (\
          select count(col3)c3,col3 from parquetFile group by col3 \
         ),\
            as4 as (\
          select count(col4)c4,col4 from parquetFile group by col4 \
         )\
          ,as5 as (select coalesce(col1,col2,col3,col4)col1,\
          (case when c1 is null then 0 else c1 end +case when c2 is null then 0 else c2 end   +case when c3 is null then 0 else c3 end +case when c4 is null then 0 else c4 end) count1 \
          from  as1 full outer join as2 on as1.col1=as2.col2   \
          full outer join as3 on as1.col1=as3.col3   \
          full outer join as4 on as1.col1=as4.col4) \
      select col1,sum(count1)count1 from as5 group by col1 ").show()

Output:

+----+------+
|col1|count1|
+----+------+
|   B|     1|
|   D|     1|
|   C|     3|
|   A|     3|
+----+------+

pyspark groupBy and count across all columns

Answers (1)

Related Questions