Reputation: 780
Assume I have a valid htmlfile which I save into a string. Now I want to extract the links of the anchor elements (hrefs). Therefore I want to use pure regular expressions.
preg_match_all('/<a [^>]*href="(.+)">/', $html, $match);
Usually I want to receive a string like that:
http://www.thisIsAHrefLinkIWantToHave.de
But instead I receive also the following string, logical caused by (.+) in the regex:
index?a=f">Link</a> <a href="index?a=ds">Link 2</a> <a href="index?b=b">Link 3</a> <a href="index?gf=d">Link 4</a> <a href="index?ttt=q">Link 5</a> <a href="index?g=my">Link 6</a> <a href="http://mysite.org
I found solutions like Xpath or DOMDocument ( PHP String Manipulation: Extract hrefs) But I'd like to have solution without those/any libraries, just with regex. What I have to do to solve the matter of my regex?
I thought about from first " to next " . But how to create that pattern or another pattern, which solves the problem?
[EDIT:] Solution
preg_match_all('/<a [^>]*href="([A-Za-z0-9\/?=:&_.]+)?"/', $html, $match);
Upvotes: 0
Views: 107
Reputation:
Musa is correct in that the period (.) is greedy. try [A-Za-z0-9_]+ instead of .+
Upvotes: 0
Reputation: 97672
Try preg_match_all('/<a [^>]*href="(.+)?">/', $html, $match);
, the ?
makes .*
non-greedy.
Upvotes: 1