john_ryan
john_ryan

Reputation: 1787

regex matching for string with character limit, specific start characters, and termination

I'm trying to extract specific portions of a url string. A simplified example is looking for any string in a url that starts with "who" or "what", has a total length of either 5 or 10 characters and stops matching on any non-alpha numeric string

for example:

http://www.test.com/who12/foo -> who12 //5 char match starting with who and ending at the /

http://www.test.com/who1234567/foo -> who1234567 //10 char match starting with who and ending at the /

http://www.test.com/what1 -> what1 //5 char match at the end of the string

http://www.test.com/what1?param=true -> what1 //5 char match breaking on the ?

I've tried setting something up here

It breaks on the / in the 5 and 10 char scenarios but fails on the ? case and the case where the match is at the end of the string.

Is there a simpler approach to accomplishing this?

Upvotes: 1

Views: 189

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626950

I suggest using

\.com\/\K(?:who[^\/?\s]{2}|what[^\/?\s])(?:[^\/?\s]{5})?

See this regex demo.

Use a capturing approach if PCRE \K match reset operator is not supported:

\.com\/((?:who[^\/?\s]{2}|what[^\/?\s])(?:[^\/?\s]{5})?)

See this regex demo

Details:

  • \.com\/ - match .com/ so as to find the necessary left hand side context for the text you need
  • (?:who[^\/?\s]{2}|what[^\/?\s])(?:[^\/?\s]{5})? - two alternatives and optional 5 chars after either of them:
    • who[^\/?\s]{2} - who followed with 2 chars other than /, ? and whitespace
    • | - or
    • what[^\/?\s] - what followed with 1 char other than /, ? and whitespace, and then...
  • (?:[^\/?\s]{5})? - optional 5 chars other than /, ? and whitespace.

Upvotes: 1

Rahul
Rahul

Reputation: 2748

Try with following regex.

Regex: (?=.{5,10})(?:who|what)(?:[^\/?\s]*)

Explanation:

  • (?=.{5,10}) lookahead checks for length of string to be 5 to 10 characters.

  • (?:who|what) matches literals who or what.

  • [^\/?\s]* is negated-character class for /,?,\s (whitespace). Hence other character than these will be matched.

Regex101 Demo

Upvotes: 0

Related Questions