Reputation: 75
I have a dataframe where every cell is a letter from A, B, C, or D. It is a very large dataframe. What's the most efficient way to come up with a count of the number of times each letter appears across the entire dataframe?
A very small example:
col1 col2 col3 col4
A C A C
B D C A
I want a count that looks like:
col count
A 3
B 1
C 3
D 1
Does not have to be sorted.
For example, I was thinking of combining the columns into one and calling groupby and count on that:
one_column_df = df.withColumn("mycol", array("col1", " col2", "col3", "col4")).select(explode(col("mycol")))
one_column_df.groupBy("col").count().show()
Is there a more efficient way than having to call array and explode?
Upvotes: 1
Views: 361
Reputation: 427
Hope this helps:
data = [
('A' , 'C' , 'A' , 'C'),
('B', 'D' , 'C' , 'A')
]
df=spark.createDataFrame(data,["col1","col2","col3","col4"])
df.createOrReplaceTempView("parquetFile")
spark.sql(" with as1 as (\
select count(col1)c1,col1 from parquetFile group by col1 \
),as2 as (\
select count(col2)c2,col2 from parquetFile group by col2 \
)\
, as3 as (\
select count(col3)c3,col3 from parquetFile group by col3 \
),\
as4 as (\
select count(col4)c4,col4 from parquetFile group by col4 \
)\
,as5 as (select coalesce(col1,col2,col3,col4)col1,\
(case when c1 is null then 0 else c1 end +case when c2 is null then 0 else c2 end +case when c3 is null then 0 else c3 end +case when c4 is null then 0 else c4 end) count1 \
from as1 full outer join as2 on as1.col1=as2.col2 \
full outer join as3 on as1.col1=as3.col3 \
full outer join as4 on as1.col1=as4.col4) \
select col1,sum(count1)count1 from as5 group by col1 ").show()
Output:
+----+------+
|col1|count1|
+----+------+
| B| 1|
| D| 1|
| C| 3|
| A| 3|
+----+------+
Upvotes: 1