Ema Nymton
Ema Nymton

Reputation: 81

Regex to skip first word between the tokens

regexp guru help needed!

I got a string, which looks like:

WordA1 wordA2 wordAN StartToken Skipword WordB1 WordB2 WordBN EndToken WordV1 WordCN

and i want to extract WordB1 WordB2 WordBN substring, taking everything between StartToken and EndToken and skipping first word inside. Usualy i solve such problems with some pattern like:

(?<= StartToken )\S+\s\K.*?(?= EndToken )

The problem is, the system i'm implementing it (hive) does not support "\K" functionality. It's also not possible to use not fixed width lookbehind (SkipWord is not fixed length, unfortunately), like:

(?<= StartToken \S+\s).*?(?= EndToken )

Another solution is

(?<= StartToken )(\S+\s)(.*)?(?= EndToken )

and take group 2, but it is very hard and requires lots of effort and code change to get exact group number.

So my question is: does anybody have simple elegant solution, which will work on hive and does not require bringing group number into regexp_extract?

Upvotes: 1

Views: 1782

Answers (2)

Thm Lee
Thm Lee

Reputation: 1236

Do you want something like this?

(?=(?:\S+\s+){3}EndToken)(?:\S+\s+){2}\S+

Demo

If the string what you want to extract consists of variable number of words, then you can try this regex.

(?<= )\b(?:(?!(?<=StartToken )\S+\s+).)+(?= EndToken)

Demo

  • (?<= )\b : means word-starting point(bounday)
  • (?= EndToken) : roles ending anchor in this regex
  • \b(?:(?!(?<=StartToken )\S+\s+).)+ : Avoid the word(\S+) which preceded by "StartToken " and try matching everything at each word-starting point(boundary) to the ending anchor.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163207

At this page I see that besides regexp_extract there is also regexp_replace.

You might try that instead and select the parts before and after the tokens using an alternation and replace that with an empty string:

(?:^.*StartToken \S+\s| EndToken.*$)

  • (?: Non capturing group
  • ^.*StartToken \S+\s From the beginning of the string matchany character zero or more times followed by StartToken, one or more non whitespace characters and a whitespace character.
  • | Or
  • EndToken.*$) Match EndToken followed by any character zero or more times until the end of the string.

Upvotes: 0

Related Questions