Maikel Penz
Maikel Penz

Reputation: 41

Pyspark - dynamic where clause in Data Frame

Is it possible to perform a dynamic "where/filter" in a dataframe ? I am running a "like" operation to remove items that match specific strings

eventsDF.where(
    ~eventsDF.myColumn.like('FirstString%') &
    ~eventsDF.myColumn.like('anotherString%')
).count()

However I need to filter based on strings that come from another dataframe/list.

The solution that I was going for (which doesn't really work) involves a function that receives an index

#my_func[0] = "FirstString"
#my_func[1] = "anotherString"

def my_func(n):
   return str(item[n])

newDf.where(
   ~newDf.useragent.like(str(my_func(1))+'%')
).count()

but I'm struggling to make it work by passing a range (mainly because it's a list instead of an integer)

newDf.where(
   ~newDf.useragent.like(str(my_func([i for i in range(2)])+'%'))
).count()

I don't want to go down the path of using "exec" or "eval" to perform it

Upvotes: 2

Views: 1043

Answers (1)

numeral
numeral

Reputation: 544

str_likes = [~df.column.like(s) for s in strings] then reduce it into one expression reduce(lambda x, y: x & y, str_likes)

It's a little bit ugly but does what you want. You can also do this in a for loop like so

bool_expr = ~df.column.like(strings[0])
for s in strings[1:]:
    bool_expr &= ~df.column.like(s)
df.where(bool_expr).count()

Upvotes: 3

Related Questions