Reputation: 569
I am trying to do a simple extraction, but I keep ending up with unpredictable results.
I have this HTML code
<div class="thread" style="margin-bottom:25px;">
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
And I am trying to extract the outertext from "msgbody" but only when the "profile" is equal to something. Like so.
$contents = $html->find('.msgbody');
$elements = $html->find('.profile');
$length = sizeof($contents);
while($x != sizeof($elements)) {
$var = $elements[$x]->outertext;
//If profile = the right name
if ($var = $name) {
$text = $contents[$x]->outertext;
echo $text;
}
$x++;
}
I get text from the wrong profiles, not the ones with the associations I need. Is there a way to just pull the desired info with one line of code?
Like if span-profile = "correct name" then pull its div-msgbody
Upvotes: 0
Views: 184
Reputation: 66
This is a Simple HTML DOM working example.
I changed your example html so there would be more than one profile for Suzy Creamcheese as follows: (file: test_class_class.htm)
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
<div class="message">
<span class="profile">Suzy Yogurt</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">No Creamcheese</div>
This is not Suzy Creamcheese <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
A reply from Suzy Creamcheese.
</div>
</div>
</div>
</div>
Here is my test using Simple HTML DOM: include('simple_html_dom.php');
function getMessage_for_profile($iUrl,$iProfile)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aoProfile = $html->find('span[class=profile]');
echo "Found ".count($aoProfile)." profiles.<br />";
foreach ($aoProfile as $key=>$oProfile)
{
if ($oProfile->plaintext == $iProfile)
{
echo "<b>Profile ".$key.": ".$oProfile->plaintext."</b><br />";
// Using $e->next_sibling ()
$oCurrent = $oProfile;
while ($oNext = $oCurrent->next_sibling())
{
if ( $oNext->class == "msgbody" )
{
echo "<hr />";
echo $oNext->outertext;
echo "<hr />";
}
$oCurrent = $oNext;
}
}
}
// clean up memory
$html->clear();
unset($html);
return;
}
// --------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');
getMessage_for_profile('test_class_class.htm','Suzy Creamcheese');
echo "<br /><br /><br />";
getMessage_for_profile('test_class_class.htm','Suzy Yogurt');
My output was:
Found 4 profiles.
Profile 0: Suzy Creamcheese
--------------------------------
New digs
Hello thank you for trying our soap.
Jim.
---------------------------------
Profile 3: Suzy Creamcheese
---------------------------------
A reply from Suzy Creamcheese.
---------------------------------
Found 4 profiles.
Profile 2: Suzy Yogurt
---------------------------------
No Creamcheese
This is not Suzy Creamcheese
Jim.
---------------------------------
See it can be done with Simple HTML DOM and since I already know how the DOM works... or enough to get in trouble... I did not have to learn any knew syntax!
Upvotes: 0
Reputation: 70497
Okay I'm going to go with DOMXpath on this one. I'm not sure what 'outer text' is supposed to mean, but I'll go with this requirement:
Like if span-profile = "correct name" then pull its div-msgbody
First off, Here's the minified HTML test case I used:
<html>
<body>
<div class="thread" style="margin-bottom:25px;">
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
</body>
</html>
So, we'll make an XPath query for this. Let's show the whole thing, then break it down:
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
The break down:
//span
Give me spans
//span[@class='profile']
Give me spans where the class is profile
//span[@class='profile' and contains(.,'$profile_name')]
Give me spans where the class is profile and the inside of the span contains
$profile_name
, which is the name you're after//span[@class='profile' and contains(.,'$profile_name')]/../
Give me spans where the class is profile and the inside of the span contains
$profile_name
, which is the name you're after now go up a level, which gets us to<div class="message">
//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']
Give me spans where the class is profile and the inside of the span contains
$profile_name
, which is the name you're after now go up a level, which gets us to<div class="message">
and finally, give me all divs under<div class="message">
where the class is msgbody
Now then, here's a sample of the PHP code:
$doc = new DOMDocument();
$doc->loadHTMLFile("test.html");
$xpath = new DOMXpath($doc);
$profile_name = 'Lars Jörgenmeier';
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
foreach ($messages as $message) {
echo trim("{$message->nodeValue}") . "\n";
}
XPath is very powerful like this. I recommend looking over a basic tutorial, then you can check the XPath standard if you want to see more advanced usage.
Upvotes: 3