Reputation: 89

Matching a Specific URL Pattern with PHP

I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. The pattern looks like this:

https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web

I'm lousy with RegEx. I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax.

Here's what I have now:

preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)

When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between.

While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). Would I use preg_match_all for that? What would that look like?

Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff?

Thanks.

Upvotes: 0

Answers (4)

rich remer

Reputation: 3577

Oh jeez, I can't believe every answer here uses "/" delimiters. If your pattern has slashes in it, use something else for the sake of readability.

Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href):

$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "<a href=\"$hrefPattern\">$innerPattern</a>";
preg_match_all("@$anchorPattern@i", $file, $matches);

This will give you something like:

[
    0 => ['<a href="https://docs.google.com/file/d/foo/edit?usp=drive_web"><span>More foo</span></a>'],
    "href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
    "inner" => ["<span>More foo</span>"]
]

And absolutely, you should use the DOM for this.

Upvotes: 1

Rottingham

Reputation: 2604

Dave,

The DOM would be better. But here is the Regex that works.

$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';

preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);

Results:

array (size=2)
    0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
    1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)

You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (.?) instead of (.). The added ? tells it to looking for any characters, of an unknown quantity. (.*) means any one character I believe.

Upvotes: 0

Steven

Reputation: 6148

You could use the following regular expression:

/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/

Which would give you the URL from the href and the innerHTML.

Break down

<a.*?href=" Matches the opening a tag and any charachters up until href="

(https:\/\/docs\.google\.com\/file\/d\/.*?)" Matches (and captures) until the end of the href (i.e. until "

.*?> Matches all characters to the end of the a tag >

(.*?)<\/a> Matches (and captures) the innerHTML until the closing a tag (i.e. </a>).

Upvotes: 0

Tharok

Reputation: 1041

Replace (.*) with (.*?) - use lazy quantification:

preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);

Upvotes: 0

Matching a Specific URL Pattern with PHP

Answers (4)

Related Questions