Reputation: 2098

Php regexp issue

Im trying to get the sentence which contain the link in the following text :

<p> Referencement PG1 est spécialiste en référencement depuis 2004. Une recherche sur <a rev="help" dir="rtl" href="http://www.referencement-site-pro.com Mot Clé</a>, aidera de nous trouver. Fascinez le regard avec le film vidéo. Vous demeurerez persistant sur les plateformes Youtube, Dailymotion ... Les images Video apparaissant dans les index de Google appâteront les surfeurs. <img style="padding:5px;float:left" src="http://thumbs.virtual-tour.tv/referencementpage1.jpg Par le appel à la Vidéo, faites-vous connaître. </p>

which means this sentence :

Une recherche sur <a rev="help" dir="rtl" href="http://www.referencement-site-pro.com Mot Clé</a>, aidera de nous trouver.

Im using this regexp :

([A-Z][^<]*)<a[^>]*>([^<]*)</a>([^\.!\?]*)

I cant find ou why its not working, it's giving me the previsou sentence with the one i need :

Referencement PG1 est spécialiste en référencement depuis 2004. Une recherche sur <a rev="help" dir="rtl" href="http://www.referencement-site-pro.com Mot Clé</a>, aidera de nous trouver.

What am-I missing ? Thanks for help =D

EDIT (some code):

preg_match_all('#([A-Z][^<\.!\?]*)<a[^>]*>([^<]*)</a>(.*[^\.!\?]*)#U', $spinnedText, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
foreach($matches[1] as $key=>$value){
//$spinnedText = str_replace($matches[0][$key], "<a {title=\"".$this->url."\"|} {rev=\"{index|help|bookmark|friend}\"|} {dir=\"rtl\"|}{rel=\"{friend|bookmark|help|}\"|} href=\"".$this->url."\">".trim($value)."</a>", $spinnedText);
$spinnedText = str_replace($matches[0][$key], "<a {title=\"".$this->url."\"|} {rev=\"{index|help|bookmark|friend}\"|} {dir=\"rtl\"|}{rel=\"{friend|bookmark|help|}\"|} href=\"".$this->url."\">".$matches[1][$key].$matches[2][$key].$matches[3][$key]."</a>", $spinnedText);
}

Upvotes: 2

Answers (3)

devsnd

Reputation: 7722

You might want to look into a DOM Parser instead:

For example: http://simplehtmldom.sourceforge.net/

Example from their site:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
    echo $element->src . '<br>';

Upvotes: 0

Michi Werner

Reputation: 336

This is called "greedy matching". It means that regex engines usually match ALL characters that the regular expression is valid for. In your example, you have to limit the START of the regex so it won't greedy-match different sentences.

Try this:

[^.!?]*<\s*a[^>]+>([^<]*)</a>[^.?!]*[.?!]

It should match the whole sentence and nothing more.

Hope this helps.

Upvotes: 0

Explosion Pills

Reputation: 191729

Your regular expression still matches the first sentence since it begins with a capital letter. You need to start out with \. or (?:^|[\.!?]) or something, but that may be a problem for you since the first sentence may also be valid in some circumstances. Is it possible that you can have multiple sentences with these links? The important question is what defines a sentence.

This will work with what you have, in addition to the first sentence after a p> and a sentence at the start of the string:

preg_match('/
   (?:           # match, but do not capture any of
   ^             # the start of the string
   |p>\s*        # or an opening or closing p tag followed by any number of spaces
   |[\.!?] )     # or sentence punctuation followed by a space
   (             # capture
   [A-Z]         # a capital letter
   .*?           # followed by any characters until
   <\/a>         # a closing anchor tag
   .*?           # followed by any characters until
   [.?!])        # closing punctuation
/x', $item, $matches);

Upvotes: 1

Php regexp issue

Answers (3)

Related Questions