harishk
harishk

Reputation: 428

Selective extraction of data from external site using DOM PHP web crawler

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.

But recently i ran into a problem. Like

this is the HTML of the forum data::

<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>

Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,

<?php
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/');
    foreach($html->find('td.FootNotes2') as $element) {
    echo $element;
}
?>

It extracts al the data that is inside with a class name as "FootNote2"

Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.

and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.

Please note that i can use "regex" like

preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);

foreach ($matchs['name'] as $k => $v){
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}

But i prefer to find solution for this in DOM parser... Any help is appreciated..

Upvotes: 4

Views: 154

Answers (3)

pguardiario
pguardiario

Reputation: 55002

You have to use regex either way so no sense overcomplicating it:

foreach($html->find('tr') as $tr) {
  echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n";
  echo $tr->find('td',3)->text() . "\n";
}

I really don't like apokryfos' approach to this, it's a lot of confusion with no benefit.

Upvotes: 0

B. Desai
B. Desai

Reputation: 16446

If you want to extract only text (not tags and its contain)

foreach ($html->find("td.FootNotes2") as $element) {

    $children = $element->children; // get an array of children
    foreach ($children AS $child) {
      $child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
    }
    echo $element->innertext."<br>";
}

o/p:

- dreamer1984
02/28/17 01:42
0
200
- monariyadh
02/27/17 23:12
0
108

Upvotes: 0

apokryfos
apokryfos

Reputation: 40690

As I said in my comment some text processing is unavoidable, however you can get the text element associated with the td like so :

require_once('dom/simple_html_dom.php'); 
$html = file_get_html('http://www.sitename.com/');
foreach ($html->find("tr") as $row) {
        $element = $row->find('td.FootNotes2',0);
        if ($element == null) { continue; }
        $textNode = array_filter($element->nodes, function ($n) {
            return $n->nodetype == 3;        //Text node type, like in jQuery     
        });

        if (!empty($textNode)) {
            $text = current($textNode);
            echo $text;         
        }

    }  

This echoes:

- dreamer1984
- monariyadh

Do with that what you will.

Updated to only find the first td for each tr.

Upvotes: 2

Related Questions