user147271
user147271

Reputation: 145

PySpark - Remove Rows After Groupby?

My spark is version 3.0

I have aggregate the data by groupBy(). I want to create a function and a threshold where if the volume of the data is <200 (this would be the threshold), then I want to remove category 'C' from the main table.

How would I do that on PySpark? I thought about creating a list to append 'C', but I am not sure how to do it =/

Image 2 is the expected output. Can someone help me?

enter image description here

enter image description here

Upvotes: 0

Views: 644

Answers (1)

notNull
notNull

Reputation: 31540

Try with groupBy and aggregate to get sum() then collect_list for all the values for category then explode the array.

Example:

df.show()
#+----+----+
#|col1|col2|
#+----+----+
#|   A| 250|
#|   A| 250|
#|   A|  50|
#|   B| 250|
#|   B| 250|
#|   B|  50|
#|   C|   5|
#|   C|   5|
#|   C|  10|
#+----+----+

from pyspark.sql.functions import *


df.groupBy("col1").agg(sum(col("col2")).alias("count"),collect_list(col("col2")).alias("col2")).\
filter(col("count") >200).\
select("col1",explode("col2").alias("col2")).\
show()
#+----+----+
#|col1|col2|
#+----+----+
#|   B| 250|
#|   B| 250|
#|   B|  50|
#|   A| 250|
#|   A| 250|
#|   A|  50|
#+----+----+

Upvotes: 1

Related Questions