Reputation: 1624
I have column :
val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
but it gives :
+---------------------------------------------------------+----------------------+
|raw_type |updatedType |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT|
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+----------------------+
But I want "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low" to give 2 values "TESTING, TOO_LOW_PURCHASE_COUNT" like this :
+---------------------------------------------------------+--------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST, Unkown |
+---------------------------------------------------------+--------------------------------+
Can someone tell what I am doing wrong here ?
Upvotes: 0
Views: 1621
Reputation: 2855
Ok. So, couple of things here,
Regarding find
, you need to check each Row
against each regex for your desired output, so find is not the right choice.
the first value produced by the iterator satisfying a predicate, if any.
Take care with regex, you've left a space after low, thats why its not matching. May you should reconsider just replacing %
with .*
also,
%purchase count % is too low %
So, with the changes, your code will be something like,
val originalSqlLikePatternMap = Map(
"item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => {
val res = javaPatternMap.map(v => {
v._1.findFirstIn(value) match {
case Some(_) => v._2
case None => ""
}
})
.filter(_.nonEmpty).mkString(", ")
if (res.isEmpty) "Unknown" else res
}
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
Output,
+---------------------------------------------------------+-------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+-------------------------------+
Hope this helps!
Upvotes: 2