Spark: First group by a column then remove the group if specific column is null

Question

Pandas code

df=df.groupby('col1').filter(lambda g: ~ (g.col2.isnull()).all())

First group with col1 and remove groups if all the elements in col2 are null. I did try following:

Pyspark

df.groupBy("col1").filter(~df.col2.isNotNull().all())

mck · Accepted Answer

You can do a non-null count over each group, and use filter to remove the rows where the count is 0:

# example dataframe
df.show()
+----+----+
|col1|col2|
+----+----+
|   1|null|
|   1|null|
|   2|   1|
|   2|null|
|   3|   1|
+----+----+

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'not_null', 
    F.count('col2').over(Window.partitionBy('col1'))
).filter('not_null != 0').drop('not_null')

df2.show()
+----+----+
|col1|col2|
+----+----+
|   3|   1|
|   2|   1|
|   2|null|
+----+----+

Spark: First group by a column then remove the group if specific column is null

Answers (1)

Related Questions