charlie_boy
charlie_boy

Reputation: 101

Filter spark dataframe with multiple conditions on multiple columns in Pyspark

I would like to implement the below SQL conditions in Pyspark

SELECT *
            FROM   table
            WHERE  NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 

What would be the clean way to do this?

Upvotes: 1

Views: 1938

Answers (2)

mck
mck

Reputation: 42422

If you're lazy, you can just copy and paste the SQL filter expression into the pyspark filter:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")

Upvotes: 1

Anand Vidvat
Anand Vidvat

Reputation: 1058

you use filter or where function for DataFrame API version.

the equivalent code would be as follows :

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))

Upvotes: 2

Related Questions