Reputation: 1342
I am working with simple web crawler. Below is simple html code i used to learn.
input.php
<ul id="nav">
<li>
<a href="www.google.com">Google</a>
<ul>
<li>
<a href="mail.gmail.com">Gmail</a>
</li>
</ul>
</li>
<li>
<a href="www.yahoo.com">Yahoo</a>
<ul>
<li>
<a href="mail.yahoo.com">Yahoo Mail</a>
</li>
</ul>
</li>
</ul>
I need to crawl the first anchor tag in ul[id=nav]->li
. The code i used to crawl input.php is
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
?>
It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?
Upvotes: 1
Views: 2440
Reputation: 126
<?php
$in = '<style> .catalog-product-view .product.attribute.overview ul { margin-top: 10px; } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';
function parseTags($input, $callback) {
$len = strlen($input);
$stack = [];
$tag = "";
$data = "";
$isTag = false;
$isString = false;
for ($i=0; $i<$len; $i++) {
$char = $input[$i];
if ($char == '<') {
$isTag = true;
$tag .= $char;
} else if ($char == '>') {
$tag .= $char;
if (substr($tag, 0, 2) == '</') {
$close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
$open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
if ($open == $close) {
$callback($tag, $data, $stack, $i, false);
array_pop($stack);
}
} else if (substr($tag, -2) == '/>') {
$callback($tag, $data, $stack, $i, false);
} else {
$callback($tag, $data, $stack, $i, true);
$stack[] = $tag;
}
$tag = "";
$data = "";
$isTag = false;
} else if ($char == '"' || $char == "'") {
if ($isString == false) {
$isString = $char;
} else if ($isString == $char && $input[$i-1] != '\\') {
$isString = false;
}
} else if ($isTag) {
$tag .= $char;
} else {
$data .= $char;
}
}
}
parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
print_r(func_get_args());
});
Upvotes: 0
Reputation: 959
you can simply achieve that by:
<?php
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',-2)->outertext."<br>";
}
}
?>
Upvotes: 0
Reputation: 41903
In this case you can directly point it out with children()
method. Example:
foreach($html->find('ul#nav') as $ul) {
foreach($ul->children() as $li) {
echo $li->children(0)->outertext . '<br/>';
}
}
Alternatively, you can use DOMDocument
+ DOMXpath
for this too:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[@id="nav"]/li/a');
foreach($links as $a) {
echo $a->nodeValue . '<br/>';
}
Upvotes: 1
Reputation: 9530
Try this:
// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
$a = $li->find('a',0);
// retrieve the link text itself.
echo "link text: " . $a->innertext() . "\n";
}
See the simple-html-dom manual for details of all these methods.
Upvotes: 0
Reputation: 750
i have done the same work in Objective-c.
You can use the XML or HTML api's to serialize your html object.
If you want to do this form cold hand... find open tag and the close tag.
After this get first child, then the second and so on...
Upvotes: 0
Reputation: 1662
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
}
?>
Upvotes: 1