Reputation: 111
Have a dataframe, which has a query as a value in one of the column, I am trying to extract the value between one/two parentheses in the first group using regex.
My Input the regex is:
select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'
Output should be:
field1
Spark Code what I used to extract the value is:
target_query = "select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'"
val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"),"(?<=\\().*?(?=\\))",0))
But I'm getting
sum(field1
Please let me know how to extract only the field1
. I may get sum(field1)
instead of nvl(sum(field1),0)
as well. Is it possible to use same regex to handle both the cases?
Upvotes: 3
Views: 8911
Reputation: 28322
Since the field1
value always will be inclosed with a sum
you can use the following regexp:
sum\\((.*?)\\)
This will match everything enclosed by a sum()
.
To get the correct match it is necessary that the field you want (field1
) is the first sum in the query (since you need to specify the groupid of the match in regexp_extract
). In truth, as long as it is in the same position (first, second, etc.) you can correctly match it. For example:
val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"), "sum\\((.*?)\\)", 1))
The 1 at the end means that everything in the first capture group is extracted.
Upvotes: 3