James Zhao
James Zhao

Reputation: 731

Regular expression missed first occurrence of target string

I am using regular expression to fetch both text1 and text2 in the following html code. Here is what I am using: /<div\s?class="right-col">[\s\n\S]*<a[\s\n]?[^>]*>@(.*)<\/a>/ but apparently I missed text1, only got text2(here is the link to my problem).

<div class="right-col">
    <h1>
        <a href="url-link-here" title="title-here">title1</a>
    </h1>
    <p>some text here</p>
<div class="some-class">
    <div class="left">
        <span><a href="url-link-here" class="breaking" target="_blank">some text here </a></span>      
    </div>
    <div class="postmeta"><a href="url-link-here" >@text1</a> </div>
</div>
<div class="right-col">
    <h1>
        <a href="url-link-here" title="title-here">title2</a>
    </h1>
    <p>some text here</p>
<div class="some-class">
    <div class="left">
        <span><a href="url-link-here" class="breaking" target="_blank">some text here </a></span>      
    </div>
    <div class="postmeta"><a href="url-link-here" >@text2</a> </div>
</div>

Can you guys tell me what went wrong in my regular expression? Is there a better way to capture both title1, title2 and text1, text2?

Upvotes: 0

Views: 101

Answers (2)

JimW
JimW

Reputation: 186

This is a fairly common issue with regular expressions as they are greedy. [\s\S]* (the \n is not needed) matches for the first '<' and 'a' and since it's greedy it will match those and continue. Adding a ? makes it not greedy and using your link returns both text1 and text2.

The short answer is to replace [\s\n\S]* with [\s\S]*? but as others have mentioned, this is probably not a good use of regular expressions.

Upvotes: 0

gen_Eric
gen_Eric

Reputation: 227310

Using a regular expression here is not the best way to do it. It's bad practice. You should be using a DOM/XML parser to do this.

I like using PHP's DOMDocument class. Using XPath, we can quickly find the elements you want

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);

$aTags = $xPath->query('//div[@class="some-class"]//a[starts-with(text(), "@")]');

foreach($aTags as $a){
    echo $a->nodeValue;
}

DEMO: http://codepad.viper-7.com/QHOXzH

Upvotes: 2

Related Questions