regex match href without passing closing tag

Question

I am trying to create a regex to parse document links (pdf, ppt, xls, doc) in a html page. I have the regex as non-greedy but the issue I am seeing is the following:

A href to a HTML page appears before the href to the document on the same line.

In this case the regex matches from the start of the href for the HTML page to the end of the document file extension in the following href on the same line.

Here's the regex I am using:

/href="\/cms\/(.*?\.(pdf|ppt|xls|doc))(\?.*?)?"/i

Here's some sample HTML to parse:

Medical

Currently this matches from the first href to the last pdf. Seems like I need to be able specify that I want the match as long as it doesn't pass a closing ">" in the expression but have not been able to figure this out.

Would appreciate any help ...

Gumbo · Accepted Answer

Since your attribute value is wrapped into double quotes, you can exclude them being matched:

/href="\/cms\/([^"]*?\.(pdf|ppt|xls|doc))(\?[^"]*?)?"/i

You can narrow the valid characters even more by using [^<"].

regex match href without passing closing tag

Answers (1)

Related Questions