Pyspark - redistribute percentages

Question

I have a table like the following:

city | center | qty_out | qty_out %
----------------------------------------
 A   | 1      |  10     | .286
 A   | 2      |  2      | .057
 A   | 3      |  23     | .657
 B   | 1      |  40     | .8
 B   | 2      |  10     | .2

city-center is unique/the primary key.

If any center within a city has a qty_out % of less than 10% (.10), I want to ignore it and redistribute its % among the other centers of the city. So the result above would become

city | center | qty_out_%
----------------------------------------
 A   | 1      | .3145
 A   | 3      | .6855
 B   | 1      | .8
 B   | 2      | .2

How can I go about this? I was thinking a window function to partition but can't think of a window function to use with this

column_list = ["city","center"]
w = Window.partitionBy([col(x) for x in column_list]).orderBy('qty_out_%')

Emma · Accepted Answer

I am not statistician, so I cannot comment on the equation, however, if I write the Spark SQL as literally as you mentioned, it'll be like this.

w = Window.partitionBy('city')
redist_cond = F.when(F.col('qty_out %') < 0.1, F.col('qty_out %'))
df = (df.withColumn('redist', F.sum(redist_cond).over(w) / (F.count('*').over(w) - F.count(redist_cond).over(w)))
      .fillna(0, subset=['redist'])
      .filter(F.col('qty_out %') >= 0.1)
      .withColumn('qty_out %', redist_cond.otherwise(F.col('qty_out %') + F.col('redist')))
      .drop('redist'))

Pyspark - redistribute percentages

Answers (1)

Related Questions