Reputation: 428

Selective extraction of data from external site using DOM PHP web crawler

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.

But recently i ran into a problem. Like

this is the HTML of the forum data::

<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>

Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,

<?php
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/');
    foreach($html->find('td.FootNotes2') as $element) {
    echo $element;
}
?>

It extracts al the data that is inside with a class name as "FootNote2"

Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.

and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.

Please note that i can use "regex" like

preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs);

foreach ($matchs['name'] as $k => $v){
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}

But i prefer to find solution for this in DOM parser... Any help is appreciated..

Upvotes: 4

Selective extraction of data from external site using DOM PHP web crawler

Answers (3)

Related Questions