Reputation: 726
I am trying to create a regex to parse document links (pdf, ppt, xls, doc) in a html page. I have the regex as non-greedy but the issue I am seeing is the following:
In this case the regex matches from the start of the href for the HTML page to the end of the document file extension in the following href on the same line.
Here's the regex I am using:
/href="\/cms\/(.*?\.(pdf|ppt|xls|doc))(\?.*?)?"/i
Here's some sample HTML to parse:
<a href="/cms/medical/plans_overview.html">Medical</a></div><a href="/cms/docs/mydoc.pdf">
Currently this matches from the first href to the last pdf. Seems like I need to be able specify that I want the match as long as it doesn't pass a closing ">" in the expression but have not been able to figure this out.
Would appreciate any help ...
Upvotes: 0
Views: 448
Reputation: 655269
Since your attribute value is wrapped into double quotes, you can exclude them being matched:
/href="\/cms\/([^"]*?\.(pdf|ppt|xls|doc))(\?[^"]*?)?"/i
You can narrow the valid characters even more by using [^<"]
.
Upvotes: 1