Reputation: 1059
How can I use Spark SQL filter as a case insensitive filter?
For example:
dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));
It just returns rows that 'vendor'
column is equal to 'fortinet'
, but I want rows that 'vendor'
column equal to 'fortinet'
or 'Fortinet'
or 'foRtinet'
or ...
Upvotes: 23
Views: 51438
Reputation: 6014
Another alternative which saves a couple of sets of parenthesis:
import pyspark.sql.functions as f
df.filter(f.upper("vendor") == "FORTINET)
Upvotes: 0
Reputation: 3228
Try to use lower/upper string functions:
dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))
or
dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))
Upvotes: 22
Reputation: 330163
You can either use case-insensitive regex:
val df = sc.parallelize(Seq(
(1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")
df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
or simple equality with lower
/ upper
:
import org.apache.spark.sql.functions.{lower, upper}
df.where(lower($"v") === "fortinet")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
df.where(upper($"v") === "FORTINET")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
For simple filters I would prefer rlike
although performance should be similar, for join
conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.
Upvotes: 36