Reputation: 419
Pandas code
df=df.groupby('col1').filter(lambda g: ~ (g.col2.isnull()).all())
First group with col1 and remove groups if all the elements in col2 are null. I did try following:
Pyspark
df.groupBy("col1").filter(~df.col2.isNotNull().all())
Upvotes: 1
Views: 1382
Reputation: 42332
You can do a non-null count over each group, and use filter to remove the rows where the count is 0:
# example dataframe
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 1|null|
| 2| 1|
| 2|null|
| 3| 1|
+----+----+
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'not_null',
F.count('col2').over(Window.partitionBy('col1'))
).filter('not_null != 0').drop('not_null')
df2.show()
+----+----+
|col1|col2|
+----+----+
| 3| 1|
| 2| 1|
| 2|null|
+----+----+
Upvotes: 1