damusnet
damusnet

Reputation: 4398

Get the content attribute value of the <meta name="description"> tag in an HTML string

I have the following string:

$string = '<meta name="Keywords" lang="fr" content="ecole commerce, apres bac, ecole management, ecole de management, écoles de commerce, école de management, classement ecole de commerce, ecole commerce paris, ecole superieure de commerce, concours ecole commerce, hec, esc, prepa, forum ecole commerce, avis ecole commerce" /><meta name="description" content="Tout pour s\'informer et échanger sur les écoles de commerce et de management, les concours, les classements, la prépa... Des  témoignages et un forum pour faire le meilleur choix" /><meta name="robots" content="all" />';

and I try to get only the "description" meta from it with this regex expression:

echo preg_replace('/(?:.*)name\="description" content\="(.*)"(?:.*)/i', '$1', $string);

but what I get is:

Tout pour s'informer et échanger sur les écoles de commerce et de management, les concours, les classements, la prépa... Des témoignages et un forum pour faire le meilleur choix" /><meta name="robots" content="all

So, why the extra " /><meta name="robots" content="all?!

Upvotes: 1

Views: 238

Answers (5)

mickmackusa
mickmackusa

Reputation: 47874

Using regex to parse HTML is not advised. To preserve multibyte characters, you can declare the document's charset as UTF-8. There are a few ways to do this.

XPath is a particularly elegant tool for isolating the target element and returning the desired attribute value.

Code: (Demo)

$doc = new DOMDocument();
$doc->loadHTML(
    mb_encode_numericentity(
        $html,
        [0x80, 0x10FFFF, 0, ~0],
        'UTF-8'
    )
);
$xpath = new DOMXPath($doc);
echo $xpath->evaluate('string(//meta[@name="description"]/@content)');

Output:

Tout pour s'informer et échanger sur les écoles de commerce
 et de management, les concours, les classements, la prépa... Des
 témoignages et un forum pour faire le meilleur choix

Upvotes: 0

valodzka
valodzka

Reputation: 5805

Don't use greedy regexps for it, this will work:

<?php echo preg_replace('/(?:.*)name\="description" content\="(.*?)"(?:.*)/i', '$1', $string); ?>

Upvotes: 1

Vincent Savard
Vincent Savard

Reputation: 35917

You should also add the option U (Ungreedy) to your regexp. In this case, it matches the last " of your string, which is why you get the tag part.

preg_replace('/(?:.*)name\="description" content\="(.*)"(?:.*)/iU', '$1', $string);

Note you could also replace it by something like this :

preg_replace('/(?:.*)name\="description" content\="([^"]*)"/i', '$1', $string);

[^"] means "anything that is not a double quote". The last (?:.*) is also useless.

I also like to use preg_match with a third argument when you want to match something and not replace it. Basically, I would do what you want to do like this :

$var = array();
preg_match('/name\="description" content\="([^"]*)"/iU', $string, $var);

$var[1] contains your string if the regexp found a match.

Upvotes: 2

Aaron Anodide
Aaron Anodide

Reputation: 17186

/(?:.)name\="description" content\="-->(.)<--this is what matches the extra stuff that you don't want/did not expect to match.

/(?:.)name\="description" content\="(.)-->"<--this is what matches the quote after the word 'all'

You want the regex to stop matching sooner rather than later, hence the need to put it into a un-greedy mode of operation (which other posters have said).

Upvotes: 0

mario
mario

Reputation: 145482

An idiom I use to avoid greedy regexes is to use a search pattern inverse to the enclosures (that is [^"] if something is supposed to be enclosed by quotes). More reliable for edgy edge cases:

  /content="([^"]*)"/i

Upvotes: 1

Related Questions