jimc2013
jimc2013

Reputation: 3

Regular expression to convert mailto links

I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:

preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)@([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|@]*)(@?)([^<]*)<\/a>/i",$content,$matches);

I pass it $content = '<a href="mailto:[email protected]">[email protected]</a>'

It returns these matched pieces:

0 <a href="mailto:[email protected]">[email protected]</a>
1  
2 "
3 [email protected]
4 name
5 domain.com
6 "
7 
8 somename
9 @
10 domain.com

Example usage: <a href="send.php?user=$matches[4][0]&dom=$matches[5][0]">ucwords($matches[8][0])</a>

My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...

Example: <a href="mailto:[email protected]"><span><b>[email protected]</b></span></a>

I need to ignore the nested tags and just extract the "some name" piece:

match part 8 = <span><b>
match part 9 = somename
match part 10 = @
match part 11 = domain.com
match part 12 = </b></span>

I've tried to get it to work by tweaking ([^<|@]*)(@?)([^<]*) but I can't figure out the right syntax to match or ignore the nested tags.

Upvotes: 0

Views: 1456

Answers (4)

Dropzilla
Dropzilla

Reputation: 512

Try this regex

/^(<.*>)(.*)(@)/

/^/- Start of string

/(<.*>)/ - First match group, starts with < then anything in between until it hits >

/(.*)(@)/ - Match anything up to the parenthesis

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89564

You can try this pattern:

$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^@]++)@(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';

and you can access your data like that:

echo $matches[0]['name'];

Upvotes: 0

Jonathan Kuhn
Jonathan Kuhn

Reputation: 15311

You could just replace the whole match between the <a> tag with a .*?. Replace ([^<|@]*)(@?)([^<]*) with (.*?) and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.

However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:

<?php
$DOM = new DOMDocument();
$DOM->loadXML('<a href="mailto:[email protected]"><span><b>[email protected]</b></span></a>');

$list = $DOM->getElementsByTagName('a');

foreach($list as $link){
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
    //only match if href starts with mailto:
    if(stripos($href, 'mailto:') === 0){
        var_dump($href);
        var_dump($text);
    }
}

http://codepad.viper-7.com/SqDKgr

Upvotes: 1

Campfire
Campfire

Reputation: 894

To only get access to the part within the link, try

[^>]*>([^>]+)@.* What you need should be in the first group of the result.

Upvotes: 0

Related Questions