Reputation: 13
Here's my regex code :
Name:<\/h5>.*?<div class="info-name">(.*?)(<a|<\/div|\|)
Here's the content :
<h5>Name:</h5>
<div class="info-name">
Josh Taguibao
</div><a class="t0 profile" >Click to view Profile</a>
I am able to get my output, which is
Josh Taguibao
However, if the content changes with something like this :
<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>
I will only be able to get Josh instead of the whole name.
May I ask on what to add on my code?
Upvotes: 0
Views: 64
Reputation: 47894
If you don't want to use an html parser (which the SO community strongly urges at every chance), you can just match and strip the tags:
Code: (PHP Demo) (Pattern Demo)
$string='<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>';
echo preg_match('~Name:</h5>.*?<div class="info-name">\s*\K.*?(?=\s*</div|\s*\|)~s',$string,$out)?strip_tags($out[0]):'fail';
Output:
Josh Taguibao
*Notes:
~
is used as the pattern delimiter so that the /
s in the pattern don't need to be escaped.\K
in the pattern means: "start the fullstring match from here"(?=...)
is a positive lookahead, which is used to halt the fullstring match before matching a newline followed by </div
or |
(normally I would write (?=\s(?:</div>|\|))
but it was actually fewer steps the verbose way)s
modifier/flag at the end of the pattern permits the .
(dots) to additionally match new lines.Now, DomDocument is not my strong suit, but I slapped together this snippet that will work on your sample text. (DomDocument Demo)
$html='<h5>Name:</h5>
<div class="info-name">
Josh <a href="#tagclan">Taguibao</a>
</div><a class="t0 profile" >Click to view Profile</a>';
$dom=new DOMDocument;
$dom->loadHTML($html);
$name=$dom->getElementsByTagName('div')->item(0)->nodeValue; // or ->textContent
echo trim($name);
// same output as regex method
nodeValue
and textContent
are effectively the same (for this case anyhow) in that they both return the tag-free text from the div element.
Manual says: textContent The text content of this node and its descendants.
Or if you need to isolate the first occurring element which has the class info-name
, then you can use XPath: (Demo)
$dom = new DOMDocument();
$dom->loadHTML($html);
var_export(
trim(
(new DOMXPath($dom))
->query('//*[@class="info-name"]')
->item(0)
->nodeValue
)
);
Upvotes: 0
Reputation: 887
HTML is structured data. This means there are tools available to parse it. Regex is not the tool for this job.
http://php.net/manual/en/book.dom.php
Upvotes: 1