slashline
slashline

Reputation: 169

RegExp extraction

Here's the input string:

loadMedia('mediacontainer1', 'http://www.something.com/videos/JohnsAwesomeVideo.flv', 'http://www.something.com/videos/JohnsAwesomeCaption.xml', '/videos/video-splash-image.gif)

With this RegExp: \'.+.xml\'

... we get this:

'mediacontainer1', 'http://www.something.com/videos/JohnsAwesomeVideo.flv', 'http://www.something.com/videos/JohnsAwesomeCaption.xml'

... but I want to extract only this:

http://www.something.com/videos/JohnsAwesomeCaption.xml

Any suggestions? I'm sure this problem has been asked before, but it's difficult to search for. I'll be happy to Accept a solution.

Thanks!

Upvotes: 3

Views: 130

Answers (4)

Sam Holder
Sam Holder

Reputation: 32936

in .net this regex works for me:

\'[\w:/.]+\.xml\'

breaking it down:

  • a ' character
  • followed by a word character or ':' or '/' or '.' any number of times (which matches the url bit)
  • followed by '.xml' (which differentiates the sought string from the other urls which it will match without this)
  • followed by another ' character

I tested it here

Edit I missed that you don't want the quotes in the result, in which case as has been pointed out you need to use look behind and look ahead to include the quotes in the search, but not in the answer. again in .net:

(?<=')[\w:/.]+\.xml(?=')

but I think the best solution is a combination of those offered already:

(?<=')[^']+\.xml(?=')

which seems the simplest to read, at least to me.

Upvotes: 2

Thomas Hupkens
Thomas Hupkens

Reputation: 1590

If you want to get everything within quotes that starts with http:

(?<=')http:[^']+(?=')

If you only want those ending with .xml

(?<=')http:[^']+\.xml(?=')
  • It doesn't select the quotation marks (as you asked)
  • It's fast!

Fair warning: it only works if the regex engine you're using can handle lookbehind

Upvotes: 2

Gabriele Petrioli
Gabriele Petrioli

Reputation: 195992

This should work (tested in javascript, but pretty sure it would work in most cases)

'[^']+?\.xml'

it looks for these rules

  • starts with '
  • is followed by anything but '
  • ends in .xml'

you can demo it at http://RegExr.com?2tp6q

Upvotes: 2

Paul Sanwald
Paul Sanwald

Reputation: 11329

Knowing the language would be helpful. Basically, you are having a problem because the + quantifier is greedy, meaning it will match the largest part of the string that it can. you need to use a non-greedy quantifier, which will match as little as possible.

We will need to know the language you're in to know what the syntax for the non-greedy quantifier should be.

Here is a perl recipe. Just as a sidenote, instead of .+, you probably want to match [^.]+.xml.

\'.+?.xml\'

should work if your language supports perl-like regexes.

Upvotes: 2

Related Questions