user3407267
user3407267

Reputation: 1624

How do I match multiple regex patterns against a column value in Spark?

I have column :

val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
  "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")

val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)

val df = Seq(
  "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
   "item (1) is blacklisted, #!@" 
).toDF("raw_type")

val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)

val result = df.withColumn("updatedType", converterUDF($"raw_type"))

but it gives :

+---------------------------------------------------------+----------------------+
|raw_type                                                 |updatedType           |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING               |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT|
|#!@                                                      |Unknown               |
|item (mejwnw) is blacklisted                             |BLACK_LIST            |
|item (1) is blacklisted, #!@                             |BLACK_LIST            |
+---------------------------------------------------------+----------------------+

But I want "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low" to give 2 values "TESTING, TOO_LOW_PURCHASE_COUNT" like this :

 +---------------------------------------------------------+--------------------------------+
|raw_type                                                 |updatedType                     |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT          |
|#!@                                                      |Unknown                         |
|item (mejwnw) is blacklisted                             |BLACK_LIST                      |
|item (1) is blacklisted, #!@                             |BLACK_LIST, Unkown              |
+---------------------------------------------------------+--------------------------------+

Can someone tell what I am doing wrong here ?

Upvotes: 0

Views: 1621

Answers (1)

Chitral Verma
Chitral Verma

Reputation: 2855

Ok. So, couple of things here,

  1. Regarding find, you need to check each Row against each regex for your desired output, so find is not the right choice.

    the first value produced by the iterator satisfying a predicate, if any.

  2. Take care with regex, you've left a space after low, thats why its not matching. May you should reconsider just replacing % with .* also,

    %purchase count % is too low %

So, with the changes, your code will be something like,

 val originalSqlLikePatternMap = Map(
      "item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
      "%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")

    val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)

    val df = Seq(
      "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
      "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
      "item (1) is blacklisted, #!@"
    ).toDF("raw_type")

    val converter = (value: String) => {
      val res = javaPatternMap.map(v => {
        v._1.findFirstIn(value) match {
          case Some(_) => v._2
          case None => ""
        }
      })
        .filter(_.nonEmpty).mkString(", ")

      if (res.isEmpty) "Unknown" else res
    }

    val converterUDF = udf(converter)

    val result = df.withColumn("updatedType", converterUDF($"raw_type"))

    result.show(false)

Output,

+---------------------------------------------------------+-------------------------------+
|raw_type                                                 |updatedType                    |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT         |
|#!@                                                      |Unknown                        |
|item (mejwnw) is blacklisted                             |BLACK_LIST                     |
|item (1) is blacklisted, #!@                             |BLACK_LIST                     |
+---------------------------------------------------------+-------------------------------+

Hope this helps!

Upvotes: 2

Related Questions