Reputation: 15350
I have a DataFrame
column containing a list with some empty values:
df.select('foo').show(10)
+----------+
|foo |
+----------+
|[,] |
|[bar, baz]|
|[,bar] |
+----------+
I would like to filter out all empty values, i.e. it should look like that:
+----------+
|foo |
+----------+
|null |
|[bar, baz]|
|[bar] |
+----------+
I tried using array_remove
, but I cannot exclude the empty string.
Upvotes: 3
Views: 1941
Reputation: 116
Please use this one, with expected output using the filter function only
from pyspark.sql.functions import expr, when, size, lit
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([[[None, None]],
[['bar', 'bar']],
[[None, 'bar']]],
schema=['foo']) \
.withColumn('foo', when(size(expr("filter(foo, elem -> elem != '')")) == 0, lit(None))
.otherwise(expr("filter(foo, elem -> elem != '')"))) \
.show(truncate=False)
+----------+
|foo |
+----------+
|null |
|[bar, bar]|
|[bar] |
+----------+
Upvotes: 4
Reputation: 7926
You can do it with the filter method
df.withColumn("newColumn", expr("filter(foo, elem -> elem != '')")).show
If you don't want to keep the original column you can just use the same name:
df.withColumn("foo", expr("filter(foo, elem -> elem != '')")).show
Result (keeping both columns)
+----------+----------+
| foo| foonew|
+----------+----------+
| [, ]| []|
|[bar, baz]|[bar, baz]|
| [, bar]| [bar]|
+----------+----------+
Upvotes: 6