user9429934
user9429934

Reputation: 71

Hive regexp_extract returning NULL

Relatively new to the world of regular expressions so please bear with me. I am trying to use regexp_extract to pull out a certain subset of data within the string but it is returning me NULL for the regular expression I am currently trying to use.

The requirement is as follows :-

String --> test=1233]3212] --> Extract 1233

String --> test=123355 --> 123355

The idea is to extract all data after test= upto either the first ] or the end of string.

The regular expression I have as of now which I am attempting to use via Hive's regexp_extract is --> test=(.*?)]|.*

select regexp_extract('test=1233]','test=(.*?)]|.*',1);
--1233 (CORRECT)

select regexp_extract('test=1233]323]','test=(.*?)]|.*',1);
--1233 (CORRECT)

select regexp_extract('test=1233323abc','test=(.*?)]|.*',1);
--NULL (Should return 1233323abc)

Would appreciate your suggestions. I am halfway there at this point :(

Upvotes: 3

Views: 2581

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You may use

regexp_extract(col,'test=([^\\]]+)',1);

It seems it is more appropriate to use regexp_replace here as you want to get the whole input upon no match.

You may use

regexp_replace(col, ".*?test=([0-9a-zA-Z]+).*", "$1")

See the regex demo.

The regex will match a whole string and it will remove all text but 1 or more letters and digits after test= substring or the whole input will be returned.

If there can be any chars but ], use

regexp_replace(col, ".*?test=([^\\]]+).*", "$1")

See another regex demo.

Upvotes: 1

Related Questions