Reputation: 874
I have a Spark dataframe where columns are integers:
MYCOLUMN:
1
1
2
5
5
5
6
The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6])
. I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames.
Is there a better approach that uses dataframes that can achieve the same result?
Upvotes: 1
Views: 4980
Reputation: 5389
Maybe groupby
and count
is similar to what you need. Here is my solution to count each number using dataframe. I'm not sure if this is going to be faster than using RDD or not.
# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))
df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')
Output from df_count.show()
+--------+-----+
|MYCOLUMN|count|
+--------+-----+
| 1| 2|
| 2| 1|
| 5| 3|
| 6| 1|
+--------+-----+
Now, you can turn to dictionary like Counter
using rdd
dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())
This will give output as {1: 2, 2: 1, 5: 3, 6: 1}
Upvotes: 4