tkumaran
tkumaran

Reputation: 111

Regex on spark dataframe column

Have a dataframe, which has a query as a value in one of the column, I am trying to extract the value between one/two parentheses in the first group using regex.

My Input the regex is:

select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'

Output should be:

field1

Spark Code what I used to extract the value is:

target_query = "select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'"

val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"),"(?<=\\().*?(?=\\))",0))

But I'm getting

sum(field1

Please let me know how to extract only the field1. I may get sum(field1) instead of nvl(sum(field1),0) as well. Is it possible to use same regex to handle both the cases?

Upvotes: 3

Views: 8911

Answers (1)

Shaido
Shaido

Reputation: 28322

Since the field1 value always will be inclosed with a sum you can use the following regexp:

sum\\((.*?)\\)

This will match everything enclosed by a sum().

To get the correct match it is necessary that the field you want (field1) is the first sum in the query (since you need to specify the groupid of the match in regexp_extract). In truth, as long as it is in the same position (first, second, etc.) you can correctly match it. For example:

val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"), "sum\\((.*?)\\)", 1))

The 1 at the end means that everything in the first capture group is extracted.

Upvotes: 3

Related Questions