Reputation: 3
I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:
preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)@([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|@]*)(@?)([^<]*)<\/a>/i",$content,$matches);
I pass it $content = '<a href="mailto:[email protected]">[email protected]</a>'
It returns these matched pieces:
0 <a href="mailto:[email protected]">[email protected]</a>
1
2 "
3 [email protected]
4 name
5 domain.com
6 "
7
8 somename
9 @
10 domain.com
Example usage: <a href="send.php?user=$matches[4][0]&dom=$matches[5][0]">ucwords($matches[8][0])</a>
My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...
Example:
<a href="mailto:[email protected]"><span><b>[email protected]</b></span></a>
I need to ignore the nested tags and just extract the "some name" piece:
match part 8 = <span><b>
match part 9 = somename
match part 10 = @
match part 11 = domain.com
match part 12 = </b></span>
I've tried to get it to work by tweaking ([^<|@]*)(@?)([^<]*)
but I can't figure out the right syntax to match or ignore the nested tags.
Upvotes: 0
Views: 1456
Reputation: 512
Try this regex
/^(<.*>)(.*)(@)/
/^/
- Start of string
/(<.*>)/
- First match group, starts with < then anything in between until it hits >
/(.*)(@)/
- Match anything up to the parenthesis
Upvotes: 0
Reputation: 89564
You can try this pattern:
$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^@]++)@(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';
and you can access your data like that:
echo $matches[0]['name'];
Upvotes: 0
Reputation: 15311
You could just replace the whole match between the <a> tag with a .*?
. Replace ([^<|@]*)(@?)([^<]*)
with (.*?)
and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.
However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:
<?php
$DOM = new DOMDocument();
$DOM->loadXML('<a href="mailto:[email protected]"><span><b>[email protected]</b></span></a>');
$list = $DOM->getElementsByTagName('a');
foreach($list as $link){
$href = $link->getAttribute('href');
$text = $link->nodeValue;
//only match if href starts with mailto:
if(stripos($href, 'mailto:') === 0){
var_dump($href);
var_dump($text);
}
}
http://codepad.viper-7.com/SqDKgr
Upvotes: 1
Reputation: 894
To only get access to the part within the link, try
[^>]*>([^>]+)@.*
What you need should be in the first group of the result.
Upvotes: 0