alexanoid
alexanoid

Reputation: 25812

Apache Spark startsWith in SQL expression

In Apache Spark API I can use startsWith function in order to test the value of the column:

myDataFrame.filter(col("columnName").startsWith("PREFIX"))

Is it possible to do the same in Spark SQL expression and if so, could you please show an example?.

Upvotes: 6

Views: 12241

Answers (4)

dzcxzl
dzcxzl

Reputation: 71

Spark 3.3.0 added the startswith function.

[SPARK-37520][SQL] Add the startswith() and endswith() string functions

https://issues.apache.org/jira/browse/SPARK-37520

https://spark.apache.org/docs/3.3.0/api/sql/index.html#startswith

Usage

> SELECT startswith('Spark SQL', 'Spark');
 true
> SELECT startswith('Spark SQL', 'SQL');
 false
> SELECT startswith('Spark SQL', null);
 NULL
> SELECT startswith(x'537061726b2053514c', x'537061726b');
 true
> SELECT startswith(x'537061726b2053514c', x'53514c');
 false

Upvotes: 1

stack0114106
stack0114106

Reputation: 8711

With regexp_replace, you can get the same results without UDFs. Check this out

scala> val df=Seq(("bPREFIX",1),("PREFIXb",2)).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: string, b: int]

scala> df.show
+-------+---+
|      a|  b|
+-------+---+
|bPREFIX|  1|
|PREFIXb|  2|
+-------+---+


scala> df.filter(regexp_replace('a,"""^PREFIX.*""","")==="").show
+-------+---+
|      a|  b|
+-------+---+
|PREFIXb|  2|
+-------+---+


scala>

or using regexp_extract()

scala> df.filter(regexp_extract('a,"""(^PREFIX)""",1)==="PREFIX").show
+-------+---+
|      a|  b|
+-------+---+
|PREFIXb|  2|
+-------+---+


scala>

Using instr() function

scala> df.filter("instr(a,'PREFIX')=1").show
+-------+---+
|      a|  b|
+-------+---+
|PREFIXb|  2|
+-------+---+


scala>

Upvotes: 7

alexanoid
alexanoid

Reputation: 25812

I found the following solution that works with Spark SQL query without custom UDP and out of the box, for example:

CASE WHEN instr(PRICE_SOURCE_INDICATOR,'MAC') = 1

Upvotes: 4

SCouto
SCouto

Reputation: 7928

You can create an UDF to do that

import org.apache.spark.sql.functions
val startsWith = udf((columnValue: String) => columnValue.startsWith("PREFIX"))

The UDF will receive the column and check it against the PREFIX, then you can use it as follows:

myDataFrame.filter(startsWith($"columnName"))

If you want a parameter as prefix you can with lit

val startsWith = udf((columnValue: String, prefix:String) => columnValue.startsWith(prefix))
myDataFrame.filter(startsWith($"columnName", lit("PREFIX")))

INPUT EXAMPLE

+-----------+------+
|letter     |number|
+-----------+------+
|    PREFIXb|     1|
|    bPREFIX|     2|
+-----------+------+

OUTPUT EXAMPLE

myDataFrame.filter(startsWith($"letter")).show
+-----------+------+
|letter     |number|
+-----------+------+
|    PREFIXb|     1|
+-----------+------+

Upvotes: 1

Related Questions