Reputation: 19826
I am currently trying to increase my knowledge of PHP and I have set myself the task of scraping a website and turning the data I retrieve into a JSON format.
Here is an example row of the data I am trying to parse:
<tr>
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
<td >
Copenhagen
</td>
<td>
Sas
</td>
<td>
SK537
</td>
<td>
02 Apr 10:20
</td>
<td class="last">
Delayed 11:30
</td>
</tr>
And here is my PHP code so far:
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table width="100%" cellspacing="0" cellpadding="0" border="0" summary="Departure times detail information"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$url_src = strip_tags($cells[0][0]);
$airport = strip_tags($cells[0][1]);
$airline = strip_tags($cells[0][2]);
$flightnum = strip_tags($cells[0][3]);
$schedule = strip_tags($cells[0][4]);
$status = strip_tags($cells[0][5]);
echo "{$url_src} - {$aiport} - {$airline} - {$flightnum} - {$schedule} - {$status}<br>\n";
}
}
I can currently get nearly all values correctly except I cannot seem to get anything for the cell that contains this:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
Can anyone help me out with what I need to get the img string, I would be happy just being able to get the entire string within the <td></td>
like this:
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
But if its possible to parse out just the src string that would be very helpful.
Upvotes: 0
Views: 222
Reputation: 1129
Your <img>
tag is not opening at all, that's why your regular expression won't parse it.
Try:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
Upvotes: 1