Lokesh Kumar P
Lokesh Kumar P

Reputation: 369

Spark dataframe negate filter conditions

I am trying apply a negation of filter condition on the DataFrame as shown below.

!(`Ship Mode` LIKE '%Truck%')

Which is throwing an exception below

Exception in thread "main" java.lang.RuntimeException: [1.3] failure: identifier expected

(!(`Ship Mode` LIKE '%Truck%'))
  ^
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
    at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
    at Main.main(Main.java:73)

Where as the same kind of negative filter conditions are working fine in MySQL. Please find below

mysql> select count(*) from audit_log where !(operation like '%Log%' or operation like '%Proj%');
+----------+
| count(*) |
+----------+
|      129 |
+----------+
1 row in set (0.05 sec)

Can anyone please let me know if this is planned to be fixed in Spark DataFrames in future releases or should I raise a JIRA.

Upvotes: 3

Views: 9292

Answers (1)

zero323
zero323

Reputation: 330063

It looks like you're using plain SQLContext where ! is not supported:

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)

val data = Seq(("a", 1, 3), ("b", 2, 6), ("c", -1, 2))

val df = sqlContext.createDataFrame(data).toDF("x1", "x2", "x3") 
df.registerTempTable("df")

sqlContext.sql("SELECT * FROM df WHERE ! (x2 >  2 OR x3 < 4)").show

// java.lang.RuntimeException: [1.25] failure: identifier expected
//
// SELECT * FROM df WHERE ! (x2 >  2 OR x3 < 4)
//                         ^

It can be easily replaced with NOT:

sqlContext.sql("SELECT * FROM df WHERE NOT (x2 >  2 OR x3 < 4)").show

// +---+---+---+
// | x1| x2| x3|
// +---+---+---+
// |  b|  2|  6|
// +---+---+---+

If you still want to use ! you should work with HiveContext:

import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)

val df1 = hiveContext.createDataFrame(data).toDF("x1", "x2", "x3")
df1.registerTempTable("df")

hiveContext.sql("SELECT * FROM df WHERE ! (x2 >  2 OR x3 < 4)").show

// +---+---+---+
// | x1| x2| x3|
// +---+---+---+
// |  b|  2|  6|
// +---+---+---+

Upvotes: 5

Related Questions