pfnuesel
pfnuesel

Reputation: 15350

Remove empty strings from list in DataFrame column

I have a DataFrame column containing a list with some empty values:

df.select('foo').show(10)
+----------+
|foo       |
+----------+
|[,]       |
|[bar, baz]|
|[,bar]    |
+----------+

I would like to filter out all empty values, i.e. it should look like that:

+----------+
|foo       |
+----------+
|null      |
|[bar, baz]|
|[bar]     |
+----------+

I tried using array_remove, but I cannot exclude the empty string.

Upvotes: 3

Views: 1941

Answers (2)

Ambar Raghuvanshi
Ambar Raghuvanshi

Reputation: 116

Please use this one, with expected output using the filter function only

from pyspark.sql.functions import expr, when, size, lit
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.createDataFrame([[[None, None]],
                       [['bar', 'bar']],
                       [[None, 'bar']]],
                      schema=['foo']) \
    .withColumn('foo', when(size(expr("filter(foo, elem -> elem != '')")) == 0, lit(None))
                .otherwise(expr("filter(foo, elem -> elem != '')"))) \
    .show(truncate=False)

+----------+
|foo       |
+----------+
|null      |
|[bar, bar]|
|[bar]     |
+----------+

Upvotes: 4

SCouto
SCouto

Reputation: 7926

You can do it with the filter method

df.withColumn("newColumn", expr("filter(foo, elem -> elem != '')")).show

If you don't want to keep the original column you can just use the same name:

df.withColumn("foo", expr("filter(foo, elem -> elem != '')")).show

Result (keeping both columns)

+----------+----------+
|       foo|    foonew|
+----------+----------+
|      [, ]|        []|
|[bar, baz]|[bar, baz]|
|   [, bar]|     [bar]|
+----------+----------+

Upvotes: 6

Related Questions