Extracting number after specific string in Spark dataframe column - Scala

Question

I have a dataframe dfwith below format

 |constraint                                     |constraint_status |constraint_msg                                                                                             
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied     
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                            |
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                          |
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|

I want to get numeric value after Value: string and create a new column Value.

Expected output

 |constraint                                     |constraint_status |constraint_msg                                                       |Value                                        
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied                          |     1.0
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                       |     1.0 
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType               |     null
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000     |     5.1210650000005 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null

I tried below code:

      df = df.withColumn("Value",split(df("constraint_msg"), "Value\: (\d+)").getItem(0))

But getting error. Need help!

org.apache.spark.sql.AnalysisException: cannot resolve 'split(constraint_msg, 'Value\: (\d+)')' due to data type mismatch: argument 1 requires string type, however, 'constraint_msg' is of array type.;;

s.polam · Accepted Answer

Check below code.

scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint            |constraint_status |constraint_msg                                                                                |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+


scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\d.\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint            |constraint_status |constraint_msg                                                                                |Value          |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+

Extracting number after specific string in Spark dataframe column - Scala

Answers (2)

Related Questions