Reputation: 145
My spark is version 3.0
I have aggregate the data by groupBy(). I want to create a function and a threshold where if the volume of the data is <200 (this would be the threshold), then I want to remove category 'C' from the main table.
How would I do that on PySpark? I thought about creating a list to append 'C', but I am not sure how to do it =/
Image 2 is the expected output. Can someone help me?
Upvotes: 0
Views: 644
Reputation: 31540
Try with groupBy
and aggregate to get sum
() then collect_list
for all the values for category then explode the array.
Example:
df.show()
#+----+----+
#|col1|col2|
#+----+----+
#| A| 250|
#| A| 250|
#| A| 50|
#| B| 250|
#| B| 250|
#| B| 50|
#| C| 5|
#| C| 5|
#| C| 10|
#+----+----+
from pyspark.sql.functions import *
df.groupBy("col1").agg(sum(col("col2")).alias("count"),collect_list(col("col2")).alias("col2")).\
filter(col("count") >200).\
select("col1",explode("col2").alias("col2")).\
show()
#+----+----+
#|col1|col2|
#+----+----+
#| B| 250|
#| B| 250|
#| B| 50|
#| A| 250|
#| A| 250|
#| A| 50|
#+----+----+
Upvotes: 1