scalawhy
scalawhy

Reputation: 33

Apply 'rlike' on a regex column?

I'm looking to apply the 'rlike' function on a column, however instead of a standard regular expression string I want to be able to input a column (which is a regular expression string).

ie. $col1.rlike($col2) where $col2 is in a regular expression format within the dataframe

I have tried applying a UDF: def rLike = udf((s: String, col: Column) => col.rlike(s))

This keeps giving me the error:

java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported

Could anyone please shed some light on how I can fix this.

Upvotes: 3

Views: 1019

Answers (2)

Ayush Jain
Ayush Jain

Reputation: 43

The original question was answered in a comment: the UDF needs to be implemented with the signature (s: String, col: String): Boolean

For anyone else who stumbled upon this question -- without using a UDF, the only semi-clean way I could find to do this as of now is to use expr function.

df.where(expr("col1 RLIKE col2"))

I found one Jira issue(SPARK-11366) requesting functions such as rlike to accept other columns, but it seems to have been closed without updates.

Upvotes: 0

Leo C
Leo C

Reputation: 22449

Method rlike does not support regex-matching pattern stored in a Column. An alternative is to use regexp_replace as shown below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("a123", "[a-z]\\d+"),
  ("b456", "[a-z]+")
).toDF("text", "pattern")

val matched = "Matched!"  // can be any value non-existent in column `text`

df.where(regexp_replace($"text", $"pattern", lit(matched)) === matched).show
// +----+--------+
// |text| pattern|
// +----+--------+
// |a123|[a-z]\d+|
// +----+--------+

In case you would like to implement a custom rlike as an UDF (which generally doesn't scale well compared with native Spark API functions), here's one way:

def rlike = udf( (text: String, pattern: String) => text match {
  case pattern.r() => true
  case _ => false
} )

// Applying the UDF
df.where(rlike($"text", $"pattern"))

Upvotes: 1

Related Questions