Reputation: 81
regexp guru help needed!
I got a string, which looks like:
WordA1 wordA2 wordAN StartToken Skipword WordB1 WordB2 WordBN EndToken WordV1 WordCN
and i want to extract WordB1 WordB2 WordBN
substring, taking everything between StartToken and EndToken and skipping first word inside. Usualy i solve such problems with some pattern like:
(?<= StartToken )\S+\s\K.*?(?= EndToken )
The problem is, the system i'm implementing it (hive) does not support "\K" functionality. It's also not possible to use not fixed width lookbehind (SkipWord is not fixed length, unfortunately), like:
(?<= StartToken \S+\s).*?(?= EndToken )
Another solution is
(?<= StartToken )(\S+\s)(.*)?(?= EndToken )
and take group 2, but it is very hard and requires lots of effort and code change to get exact group number.
So my question is: does anybody have simple elegant solution, which will work on hive and does not require bringing group number into regexp_extract?
Upvotes: 1
Views: 1782
Reputation: 1236
Do you want something like this?
(?=(?:\S+\s+){3}EndToken)(?:\S+\s+){2}\S+
If the string what you want to extract consists of variable number of words
, then you can try this regex.
(?<= )\b(?:(?!(?<=StartToken )\S+\s+).)+(?= EndToken)
(?<= )\b
: means word-starting point
(bounday)(?= EndToken)
: roles ending anchor
in this regex\b(?:(?!(?<=StartToken )\S+\s+).)+
: Avoid the word(\S+
) which preceded by "StartToken
" and try matching everything at each word-starting point
(boundary) to the ending anchor
.Upvotes: 2
Reputation: 163207
At this page I see that besides regexp_extract
there is also regexp_replace
.
You might try that instead and select the parts before and after the tokens using an alternation and replace that with an empty string:
(?:^.*StartToken \S+\s| EndToken.*$)
(?:
Non capturing group^.*StartToken \S+\s
From the beginning of the string matchany character zero or more times followed by StartToken
, one or more non whitespace characters and a whitespace character.|
OrEndToken.*$)
Match EndToken
followed by any character zero or more times until the end of the string.Upvotes: 0