Calculate statistics based on entire column in pyspark

Question

I'm trying to fit a distribution to an entire column in PySpark using the pandas_udf annotation.

Spark splits the column into smaller chunks, and therefore I can't manage to get the distribution to be based on the entire population (all the values for this column).

This is the code I'm using:

from pyspark.sql import Row
import pandas as pd
import numpy as np
import scipy.stats as st
l = [('a',0),('b',0.1),('c',0.2),('d',0.3),('e',0.4),('f',0.5)]
rdd = sc.parallelize(l)
rdd2 = rdd.map(lambda x: Row(name=x[0], val=float(x[1])))
dataframe = sqlContext.createDataFrame(rdd2)


@pandas_udf('float')
def expon_cdf_udf(x):
  loc,scale = st.expon.fit(x)
  return pd.Series(st.expon.cdf(x, loc = loc,scale = scale))

dataframe = dataframe.withColumn("CDF",expon_cdf_udf(dataframe['val']))

display(dataframe)

Results:

name val cdf

a   0   0.27438605
b   0.1 0.20088507
c   0.2 0.75132775
d   0.3 0.88602823
e   0.4 0
f   0.5 0.23020019

The results i'm getting are based on parts of the population and not the entire vector. For example, spark would try to fit a distribution to one value at a time, and the results are obviously wrong.

Is there a way to constraint spark to run on the entire column? I'm aware of the fact it is not scaleable, but there isn't any option in my case.

Alper t. Turker · Accepted Answer

TL;DR This is not the use case for which pandas_udf is designed.

Is there a way to constraint spark to run on the entire column? I'm aware of the fact it is not scalable, but there isn't any option in my case.

Of course you can

Collect all data using toPandas and run your function on the result.
coalesce(1) and run pandas_udf on the result.
Use grouping variant with groupBy(lit(1)) and run function on the result.

but if you do any of these, you could just go with Pandas from the beginning. If you can, then do it, and don't waste time with hacking Spark.

Calculate statistics based on entire column in pyspark

Answers (1)

Related Questions