Arvinth
Arvinth

Reputation: 70

Extracting number after specific string in Spark dataframe column - Scala

I have a dataframe dfwith below format

 |constraint                                     |constraint_status |constraint_msg                                                                                             
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied     
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                            |
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                          |
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|

I want to get numeric value after Value: string and create a new column Value.

Expected output

 |constraint                                     |constraint_status |constraint_msg                                                       |Value                                        
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied                          |     1.0
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                       |     1.0 
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType               |     null
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000     |     5.1210650000005 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null  

I tried below code:

      df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))

But getting error. Need help!

org.apache.spark.sql.AnalysisException: cannot resolve 'split(constraint_msg, 'Value\: (\d+)')' due to data type mismatch: argument 1 requires string type, however, 'constraint_msg' is of array type.;;

Upvotes: 1

Views: 1407

Answers (2)

s.polam
s.polam

Reputation: 10382

Check below code.

scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint            |constraint_status |constraint_msg                                                                                |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+


scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint            |constraint_status |constraint_msg                                                                                |Value          |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+

Upvotes: 1

Shantanu Kher
Shantanu Kher

Reputation: 1054

when..otherwise will help you to filter those records first which do not contain Value:. Assuming constraint_msg always starts with Value:, I am picking second element after split as desired value.

val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")

val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))

df1.show()
+--------------------+-----------------+--------------------+-----+
|          constraint|constraint_status|      constraint_msg|Value|
+--------------------+-----------------+--------------------+-----+
|CompletenessConst...|          Success|Value: 1.0 Notnul...|  1.0|
|PatternMatchConst...|          Failure|Expected type of ...| null|
+--------------------+-----------------+--------------------+-----+

Upvotes: 2

Related Questions