LUZO
LUZO

Reputation: 1029

How to extract number from string column?

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.

Input data:

code,id,mode,location,status,comment 
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received

Expected output:

AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856 
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463 

I have tried it in spark-sql, the query I am using is given below:

val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")

However, I'm getting the following error:

org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0

Upvotes: 6

Views: 15188

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can use a udf function as following

import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)

df.withColumn("newColumn", extractString($"comment")).show(false)

where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.

Updated

To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter

import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)

You can edit the filter according to your need.

Upvotes: 3

philantrovert
philantrovert

Reputation: 10082

You can use regexp_extract which has the definition :

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex

df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show

+-----+---+-------+---------+------+--------------------+-------+
| code| id|   mode| location|status|             comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad|     D|order got delayed...|  R1657|
|FY-YT|102|Airways|    Delhi|    ND|R7856 package dam...|  R7856|
|TY-OP|103|Airways|     Pune|     D|Order number R546...|  R5463|
+-----+---+-------+---------+------+--------------------+-------+

Upvotes: 12

Related Questions