Convert Pyspark dataframe column to dict without RDD conversion

Question

I have a Spark dataframe where columns are integers:

MYCOLUMN:
1
1
2
5
5
5
6

The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6]). I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames.

Is there a better approach that uses dataframes that can achieve the same result?

titipata · Accepted Answer

Maybe groupby and count is similar to what you need. Here is my solution to count each number using dataframe. I'm not sure if this is going to be faster than using RDD or not.

# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))

df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')

Output from df_count.show()

+--------+-----+
|MYCOLUMN|count|
+--------+-----+
|       1|    2|
|       2|    1|
|       5|    3|
|       6|    1|
+--------+-----+

Now, you can turn to dictionary like Counter using rdd

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

This will give output as {1: 2, 2: 1, 5: 3, 6: 1}

Convert Pyspark dataframe column to dict without RDD conversion

Answers (1)

Related Questions