sparkonhdfs
sparkonhdfs

Reputation: 1343

Regular Expression to get surrounding text, but not matching words in between

I am trying to write a regular expression to extract URLs, which have endpoints with the following format:

https://api.siteurl.com/id/a1b2c3d4/apps

https://api.siteurl.com/id/a1b2c3d4/devices

...

etc

The id in these urls are a1b2c3d4, and can differ between URLs, but I want to extract the text that surrounds it:

The following regular expression matches the entire string:

https:\/\/\S+\.\S+\.com\/id\/\S+\/\S+

However, I don't want to extract the id itself, and just want to use it as a lookahead.

The final extracted string should be like https://api.siteurl.com/id'...'apps'

Where the ... is not actually extracted.

Is it only possible to do this using 2 regexes, where each uses a look-ahead and a look-behind, or can a single expression be used to extract just the relevant parts of the url?

Upvotes: 0

Views: 863

Answers (1)

The fourth bird
The fourth bird

Reputation: 163227

You could use 2 capturing groups to capture the data that you want to keep, and match the data that you don't want to keep.

(https:\/\/\S+\.\S+\.com\/id)\/[^\/]+\/(\S+)
  • ( Capture group 1
    • https:\/\/\S+\.\S+\.com\/id Match the start of the string till id without /
  • ) Close group
  • \/ Match the / following
  • [^\/]+\/ Match +1 times any char except /, then match /
  • (\S+) Capture group 2 Match 1+ times a non whitespace char

Regex demo

This is the pattern from the comment without the non capturing group (?: as it is unnecessary.

Upvotes: 1

Related Questions