Reputation: 71
Relatively new to the world of regular expressions so please bear with me. I am trying to use regexp_extract to pull out a certain subset of data within the string but it is returning me NULL for the regular expression I am currently trying to use.
The requirement is as follows :-
String --> test=1233]3212] --> Extract 1233
String --> test=123355 --> 123355
The idea is to extract all data after test= upto either the first ] or the end of string.
The regular expression I have as of now which I am attempting to use via Hive's regexp_extract is --> test=(.*?)]|.*
select regexp_extract('test=1233]','test=(.*?)]|.*',1);
--1233 (CORRECT)
select regexp_extract('test=1233]323]','test=(.*?)]|.*',1);
--1233 (CORRECT)
select regexp_extract('test=1233323abc','test=(.*?)]|.*',1);
--NULL (Should return 1233323abc)
Would appreciate your suggestions. I am halfway there at this point :(
Upvotes: 3
Views: 2581
Reputation: 627536
You may use
regexp_extract(col,'test=([^\\]]+)',1);
It seems it is more appropriate to use regexp_replace
here as you want to get the whole input upon no match.
You may use
regexp_replace(col, ".*?test=([0-9a-zA-Z]+).*", "$1")
See the regex demo.
The regex will match a whole string and it will remove all text but 1 or more letters and digits after test=
substring or the whole input will be returned.
If there can be any chars but ]
, use
regexp_replace(col, ".*?test=([^\\]]+).*", "$1")
See another regex demo.
Upvotes: 1