Reputation: 4398
I have the following string:
$string = '<meta name="Keywords" lang="fr" content="ecole commerce, apres bac, ecole management, ecole de management, écoles de commerce, école de management, classement ecole de commerce, ecole commerce paris, ecole superieure de commerce, concours ecole commerce, hec, esc, prepa, forum ecole commerce, avis ecole commerce" /><meta name="description" content="Tout pour s\'informer et échanger sur les écoles de commerce et de management, les concours, les classements, la prépa... Des témoignages et un forum pour faire le meilleur choix" /><meta name="robots" content="all" />';
and I try to get only the "description" meta from it with this regex expression:
echo preg_replace('/(?:.*)name\="description" content\="(.*)"(?:.*)/i', '$1', $string);
but what I get is:
Tout pour s'informer et échanger sur les écoles de commerce et de management, les concours, les classements, la prépa... Des témoignages et un forum pour faire le meilleur choix" /><meta name="robots" content="all
So, why the extra " /><meta name="robots" content="all
?!
Upvotes: 1
Views: 238
Reputation: 47874
Using regex to parse HTML is not advised. To preserve multibyte characters, you can declare the document's charset as UTF-8. There are a few ways to do this.
XPath is a particularly elegant tool for isolating the target element and returning the desired attribute value.
Code: (Demo)
$doc = new DOMDocument();
$doc->loadHTML(
mb_encode_numericentity(
$html,
[0x80, 0x10FFFF, 0, ~0],
'UTF-8'
)
);
$xpath = new DOMXPath($doc);
echo $xpath->evaluate('string(//meta[@name="description"]/@content)');
Output:
Tout pour s'informer et échanger sur les écoles de commerce
et de management, les concours, les classements, la prépa... Des
témoignages et un forum pour faire le meilleur choix
Upvotes: 0
Reputation: 5805
Don't use greedy regexps for it, this will work:
<?php echo preg_replace('/(?:.*)name\="description" content\="(.*?)"(?:.*)/i', '$1', $string); ?>
Upvotes: 1
Reputation: 35917
You should also add the option U (Ungreedy) to your regexp. In this case, it matches the last " of your string, which is why you get the tag part.
preg_replace('/(?:.*)name\="description" content\="(.*)"(?:.*)/iU', '$1', $string);
Note you could also replace it by something like this :
preg_replace('/(?:.*)name\="description" content\="([^"]*)"/i', '$1', $string);
[^"] means "anything that is not a double quote". The last (?:.*) is also useless.
I also like to use preg_match with a third argument when you want to match something and not replace it. Basically, I would do what you want to do like this :
$var = array();
preg_match('/name\="description" content\="([^"]*)"/iU', $string, $var);
$var[1] contains your string if the regexp found a match.
Upvotes: 2
Reputation: 17186
/(?:.)name\="description" content\="-->(.)<--this is what matches the extra stuff that you don't want/did not expect to match.
/(?:.)name\="description" content\="(.)-->"<--this is what matches the quote after the word 'all'
You want the regex to stop matching sooner rather than later, hence the need to put it into a un-greedy mode of operation (which other posters have said).
Upvotes: 0
Reputation: 145482
An idiom I use to avoid greedy regexes is to use a search pattern inverse to the enclosures (that is [^"]
if something is supposed to be enclosed by quotes). More reliable for edgy edge cases:
/content="([^"]*)"/i
Upvotes: 1