Donal Rafferty
Donal Rafferty

Reputation: 19826

Beginner PHP scraping help - getting img src?

I am currently trying to increase my knowledge of PHP and I have set myself the task of scraping a website and turning the data I retrieve into a JSON format.

Here is an example row of the data I am trying to parse:

 <tr>
 <td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
 </td>
 <td >
      Copenhagen
 </td>
 <td>
      Sas
 </td>
 <td>
     SK537
 </td>
 <td>
     02 Apr 10:20
 </td>
 <td class="last">
     Delayed 11:30
 </td>
 </tr>

And here is my PHP code so far:

$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table width="100%" cellspacing="0" cellpadding="0" border="0" summary="Departure times detail information"');

$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);

foreach ($rows[0] as $row){

    if ((strpos($row,'<th')===false)){

        preg_match_all("|<td(.*)</td>|U",$row,$cells);

        $url_src = strip_tags($cells[0][0]);

        $airport = strip_tags($cells[0][1]);

        $airline = strip_tags($cells[0][2]);

            $flightnum = strip_tags($cells[0][3]);

            $schedule = strip_tags($cells[0][4]);

            $status = strip_tags($cells[0][5]);

        echo "{$url_src} - {$aiport} - {$airline} - {$flightnum} - {$schedule} -  {$status}<br>\n";

    }

}

I can currently get nearly all values correctly except I cannot seem to get anything for the cell that contains this:

<td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
</td>

Can anyone help me out with what I need to get the img string, I would be happy just being able to get the entire string within the <td></td> like this:

<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />

But if its possible to parse out just the src string that would be very helpful.

Upvotes: 0

Views: 222

Answers (1)

Omega
Omega

Reputation: 1129

Your <img> tag is not opening at all, that's why your regular expression won't parse it.

Try:

<td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
</td>

Upvotes: 1

Related Questions