Reputation: 49
I have a table like the following:
city | center | qty_out | qty_out %
----------------------------------------
A | 1 | 10 | .286
A | 2 | 2 | .057
A | 3 | 23 | .657
B | 1 | 40 | .8
B | 2 | 10 | .2
city-center is unique/the primary key.
If any center within a city has a qty_out % of less than 10% (.10), I want to ignore it and redistribute its % among the other centers of the city. So the result above would become
city | center | qty_out_%
----------------------------------------
A | 1 | .3145
A | 3 | .6855
B | 1 | .8
B | 2 | .2
How can I go about this? I was thinking a window function to partition but can't think of a window function to use with this
column_list = ["city","center"]
w = Window.partitionBy([col(x) for x in column_list]).orderBy('qty_out_%')
Upvotes: 0
Views: 46
Reputation: 9308
I am not statistician, so I cannot comment on the equation, however, if I write the Spark SQL as literally as you mentioned, it'll be like this.
w = Window.partitionBy('city')
redist_cond = F.when(F.col('qty_out %') < 0.1, F.col('qty_out %'))
df = (df.withColumn('redist', F.sum(redist_cond).over(w) / (F.count('*').over(w) - F.count(redist_cond).over(w)))
.fillna(0, subset=['redist'])
.filter(F.col('qty_out %') >= 0.1)
.withColumn('qty_out %', redist_cond.otherwise(F.col('qty_out %') + F.col('redist')))
.drop('redist'))
Upvotes: 1