Reputation: 70
I have a dataframe df
with below format
|constraint |constraint_status |constraint_msg
+----------------------------------------------------------------------------------------------------------------+--------------------------------+
|CompletenessConstraint |Success |Value: 1.0 Notnull condition should be satisfied
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied |
|PatternMatchConstraint |Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType |
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
I want to get numeric value after Value:
string and create a new column Value
.
Expected output
|constraint |constraint_status |constraint_msg |Value
+----------------------------------------------------------------------------------------------------------------+--------------------------------+
|CompletenessConstraint |Success |Value: 1.0 Notnull condition should be satisfied | 1.0
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied | 1.0
|PatternMatchConstraint |Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType | null
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000 | 5.1210650000005
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null
I tried below code:
df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))
But getting error. Need help!
org.apache.spark.sql.AnalysisException: cannot resolve 'split(
constraint_msg
, 'Value\: (\d+)')' due to data type mismatch: argument 1 requires string type, however, 'constraint_msg
' is of array type.;;
Upvotes: 1
Views: 1407
Reputation: 10382
Check below code.
scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint |constraint_status |constraint_msg |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success |Value: 1.0 Notnull condition should be satisfied |
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied |
|PatternMatchConstraint|Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType |
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000 |
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+
scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint |constraint_status |constraint_msg |Value |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success |Value: 1.0 Notnull condition should be satisfied |1.0 |
|UniquenessConstraint |Success |Value: 1.0 Uniqueness condition should be satisfied |1.0 |
|PatternMatchConstraint|Failure |Expected type of column CHD_ACCOUNT_NUMBER to be StringType | |
|MinimumConstraint |Success |Value: 5.1210650000005 Minimum value should be greater than 10.000000 |5.1210650000005|
|HistogramConstraint |Failure |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
Upvotes: 1
Reputation: 1054
when..otherwise
will help you to filter those records first which do not contain Value:
. Assuming constraint_msg always starts with Value:
, I am picking second element after split as desired value.
val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")
val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))
df1.show()
+--------------------+-----------------+--------------------+-----+
| constraint|constraint_status| constraint_msg|Value|
+--------------------+-----------------+--------------------+-----+
|CompletenessConst...| Success|Value: 1.0 Notnul...| 1.0|
|PatternMatchConst...| Failure|Expected type of ...| null|
+--------------------+-----------------+--------------------+-----+
Upvotes: 2