Reputation: 2783
I'm trying to fit a distribution to an entire column in PySpark using the pandas_udf
annotation.
Spark splits the column into smaller chunks, and therefore I can't manage to get the distribution to be based on the entire population (all the values for this column).
This is the code I'm using:
from pyspark.sql import Row
import pandas as pd
import numpy as np
import scipy.stats as st
l = [('a',0),('b',0.1),('c',0.2),('d',0.3),('e',0.4),('f',0.5)]
rdd = sc.parallelize(l)
rdd2 = rdd.map(lambda x: Row(name=x[0], val=float(x[1])))
dataframe = sqlContext.createDataFrame(rdd2)
@pandas_udf('float')
def expon_cdf_udf(x):
loc,scale = st.expon.fit(x)
return pd.Series(st.expon.cdf(x, loc = loc,scale = scale))
dataframe = dataframe.withColumn("CDF",expon_cdf_udf(dataframe['val']))
display(dataframe)
Results:
name val cdf
a 0 0.27438605
b 0.1 0.20088507
c 0.2 0.75132775
d 0.3 0.88602823
e 0.4 0
f 0.5 0.23020019
The results i'm getting are based on parts of the population and not the entire vector. For example, spark would try to fit a distribution to one value at a time, and the results are obviously wrong.
Is there a way to constraint spark to run on the entire column? I'm aware of the fact it is not scaleable, but there isn't any option in my case.
Upvotes: 0
Views: 1490
Reputation: 35249
TL;DR This is not the use case for which pandas_udf
is designed.
Is there a way to constraint spark to run on the entire column? I'm aware of the fact it is not scalable, but there isn't any option in my case.
Of course you can
toPandas
and run your function on the result.coalesce(1)
and run pandas_udf
on the result.groupBy(lit(1))
and run function on the result.but if you do any of these, you could just go with Pandas from the beginning. If you can, then do it, and don't waste time with hacking Spark.
Upvotes: 1