Gunjack Sparrohw
Gunjack Sparrohw

Reputation: 13

Unable to get proper output from regex

Here's my regex code :

Name:<\/h5>.*?<div class="info-name">(.*?)(<a|<\/div|\|)

Here's the content :

<h5>Name:</h5>
<div class="info-name">
Josh Taguibao
</div><a class="t0 profile" >Click to view Profile</a>

I am able to get my output, which is

Josh Taguibao

However, if the content changes with something like this :

<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>

I will only be able to get Josh instead of the whole name.

May I ask on what to add on my code?

Upvotes: 0

Views: 64

Answers (2)

mickmackusa
mickmackusa

Reputation: 47894

If you don't want to use an html parser (which the SO community strongly urges at every chance), you can just match and strip the tags:

Code: (PHP Demo) (Pattern Demo)

$string='<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>';

echo preg_match('~Name:</h5>.*?<div class="info-name">\s*\K.*?(?=\s*</div|\s*\|)~s',$string,$out)?strip_tags($out[0]):'fail';

Output:

Josh Taguibao

*Notes:

  • ~ is used as the pattern delimiter so that the /s in the pattern don't need to be escaped.
  • \K in the pattern means: "start the fullstring match from here"
  • (?=...) is a positive lookahead, which is used to halt the fullstring match before matching a newline followed by </div or | (normally I would write (?=\s(?:</div>|\|)) but it was actually fewer steps the verbose way)
  • The s modifier/flag at the end of the pattern permits the . (dots) to additionally match new lines.

Now, DomDocument is not my strong suit, but I slapped together this snippet that will work on your sample text. (DomDocument Demo)

$html='<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>';

$dom=new DOMDocument; 
$dom->loadHTML($html); 
$name=$dom->getElementsByTagName('div')->item(0)->nodeValue; // or ->textContent
echo trim($name);
// same output as regex method

nodeValue and textContent are effectively the same (for this case anyhow) in that they both return the tag-free text from the div element.

Manual says: textContent The text content of this node and its descendants.


Or if you need to isolate the first occurring element which has the class info-name, then you can use XPath: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
var_export(
    trim(
        (new DOMXPath($dom))
        ->query('//*[@class="info-name"]')
        ->item(0)
        ->nodeValue
    )
);

Upvotes: 0

linden2015
linden2015

Reputation: 887

HTML is structured data. This means there are tools available to parse it. Regex is not the tool for this job.

http://php.net/manual/en/book.dom.php

Upvotes: 1

Related Questions