Anastasia
Anastasia

Reputation: 874

Convert Pyspark dataframe column to dict without RDD conversion

I have a Spark dataframe where columns are integers:

MYCOLUMN:
1
1
2
5
5
5
6

The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6]). I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames.

Is there a better approach that uses dataframes that can achieve the same result?

Upvotes: 1

Views: 4980

Answers (1)

titipata
titipata

Reputation: 5389

Maybe groupby and count is similar to what you need. Here is my solution to count each number using dataframe. I'm not sure if this is going to be faster than using RDD or not.

# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))

df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')

Output from df_count.show()

+--------+-----+
|MYCOLUMN|count|
+--------+-----+
|       1|    2|
|       2|    1|
|       5|    3|
|       6|    1|
+--------+-----+

Now, you can turn to dictionary like Counter using rdd

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

This will give output as {1: 2, 2: 1, 5: 3, 6: 1}

Upvotes: 4

Related Questions