Daniel Upton
Daniel Upton

Reputation: 5741

Regex to get a URL containing a keyword

Due to redbubble.com's lack of an API, I'm using an ATOM feed to steal information about a user's pictures.

This is what the XML looks like:

<entry>
  <id>ID</id>
  <published>Date Published</published>
  <updated>Date Updated</updated>
  <link type="text/html" rel="alternate" href="http://www.redbubble.com/link/to/post"/>
  <title>Title</title>
  <content type="html">
    Blah blah blah stuff about the image..
    &lt;a href="http://www.redbubble.com/products/configure/config-id"&gt;&lt;img src="http://ih1.redbubble.net/path-to-image" alt="" /&gt;
  </content>
  <author>
  <name>Author Name</name>
  <uri>http://www.redbubble.com/people/author-user-name</uri>
  </author>
  <link type="image/jpeg" rel="enclosure" href="http://ih0.redbubble.net/path-to-the-original-image"/>
  <category term="1"/>
  <category term="2"/>
</entry>

Basically using regex... how would I go about getting the href property inside the link in the content tag?

One thing we know for sure is it will always have configure in the path i.e. http://somesite.com/**configure**/id

So basically I just need to find the URL with configure in and grab the whole thing...

Upvotes: 1

Views: 2626

Answers (4)

Daniel Upton
Daniel Upton

Reputation: 5741

Thanks for your awesome answers but my colleague solved it for me!

This is what i ended up using:

/http:\/\/([^"\/]*\/)*configure\/([^"]*)/

(Ruby regex by the way)

Upvotes: 1

stema
stema

Reputation: 93006

If you have to use regex try this one:

href="(?=[^"]*configure)([^"]*)

rubular.com

I am using a lookahead to find if it contains configure.

Upvotes: 1

Matt Ball
Matt Ball

Reputation: 359906

Whatever programming language you're using, don't try to parse the whole thing with a regex. Use an XML parser first to extract the href="...". Then, sure, use a regex to make sure the URL contains configure.

As @KARASZI commented, XPath is another good approach.

Upvotes: 1

Leons
Leons

Reputation: 2674

The following regex will extract the href content based on your requirements. It seems to work for the sample code.

href="(\w[^"]+/configure/\w[^"]+)

Upvotes: 2

Related Questions