Arman
Arman

Reputation: 1059

Spark SQL case insensitive filter for column conditions

How can I use Spark SQL filter as a case insensitive filter?

For example:

dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));

It just returns rows that 'vendor' column is equal to 'fortinet', but I want rows that 'vendor' column equal to 'fortinet' or 'Fortinet' or 'foRtinet' or ...

Upvotes: 23

Views: 51438

Answers (3)

NonCreature0714
NonCreature0714

Reputation: 6014

Another alternative which saves a couple of sets of parenthesis:

import pyspark.sql.functions as f

df.filter(f.upper("vendor") == "FORTINET)

Upvotes: 0

Shawn Guo
Shawn Guo

Reputation: 3228

Try to use lower/upper string functions:

dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

or

dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))

Upvotes: 22

zero323
zero323

Reputation: 330163

You can either use case-insensitive regex:

val df = sc.parallelize(Seq(
  (1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")

df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

or simple equality with lower / upper:

import org.apache.spark.sql.functions.{lower, upper}

df.where(lower($"v") === "fortinet")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

df.where(upper($"v") === "FORTINET")
// +---+--------+
// |  k|       v|
// +---+--------+
// |  1|Fortinet|
// |  2|foRtinet|
// +---+--------+

For simple filters I would prefer rlike although performance should be similar, for join conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.

Upvotes: 36

Related Questions