user2853437
user2853437

Reputation: 780

How to extract hrefs from HTML with PHP

Assume I have a valid htmlfile which I save into a string. Now I want to extract the links of the anchor elements (hrefs). Therefore I want to use pure regular expressions.

preg_match_all('/<a [^>]*href="(.+)">/', $html, $match);

Usually I want to receive a string like that:

http://www.thisIsAHrefLinkIWantToHave.de

But instead I receive also the following string, logical caused by (.+) in the regex:

index?a=f">Link</a> &nbsp; <a href="index?a=ds">Link 2</a> &nbsp; <a href="index?b=b">Link 3</a> &nbsp; <a href="index?gf=d">Link 4</a> &nbsp; <a href="index?ttt=q">Link 5</a> &nbsp; <a href="index?g=my">Link 6</a> &nbsp; <a href="http://mysite.org

I found solutions like Xpath or DOMDocument ( PHP String Manipulation: Extract hrefs) But I'd like to have solution without those/any libraries, just with regex. What I have to do to solve the matter of my regex?

I thought about from first " to next " . But how to create that pattern or another pattern, which solves the problem?

[EDIT:] Solution

preg_match_all('/<a [^>]*href="([A-Za-z0-9\/?=:&_.]+)?"/', $html, $match);

Upvotes: 0

Views: 107

Answers (2)

user2848613
user2848613

Reputation:

Musa is correct in that the period (.) is greedy. try [A-Za-z0-9_]+ instead of .+

Upvotes: 0

Musa
Musa

Reputation: 97672

Try preg_match_all('/<a [^>]*href="(.+)?">/', $html, $match);, the ? makes .* non-greedy.

Upvotes: 1

Related Questions