Abhishek Choudhary
Abhishek Choudhary

Reputation: 8385

Filter row based on column value not present in List Of String or not

I have a dataframe

var input1 = spark.createDataFrame(Seq(
        (10L, "Joe Doe", 34),
        (11L, "Jane Doe", 31),
        (12L, "Alice Jones", 25)
        )).toDF("id", "name", "age")

I am trying to filter row which are not available in the List. I can filter based on age and id easily -

input1.filter("age not in (31,56,81)").show()

But same is not working when I am trying to filter based on name

input1.filter("name not in ("joe Doe","Pappu cam","Log")").show()

There must be some representation of string while filtering.

I am getting exception

org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'Doe' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 1, pos 16)
== SQL ==
name not in (Joe Doe,abc dej)
----------------^^^

Upvotes: 0

Views: 1466

Answers (2)

Yaron
Yaron

Reputation: 10450

Try to escape the SQL query:

input1.filter(s"""name not in ("joe Doe","Pappu cam","Log")""").show()

Upvotes: 1

chresse
chresse

Reputation: 5805

seems like a syntax error. try:

input1.filter("name not in ('joe Doe','Pappu cam','Log')").show()

Upvotes: 2

Related Questions