Reputation: 3281
I have just a PHP script for HTML parsing and it works on simple web sites, but now I need to parse the cinema program from this website. I am using the file_get_contents
function, which returns just 4 new line delimiters \n
and I just can't figure out why.
The website itself will be more difficult to parse with DOMDocument a XPath because the program itself is just pop-up window and it doesn't seem to change the URL address but I will try to handle this problem after retrieving the HTML code of the site.
Here is the shortened version of my script:
<?php
$url = "http://www.cinemacity.cz/";
$content = file_get_contents($url);
$dom = new DomDocument;
$dom->loadHTML($content);
if ($dom == FALSE) {
echo "FAAAAIL\n";
}
$xpath = new DOMXPath($dom);
$tags = $xpath->query("/html");
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
?>
EDIT:
So, following the advice by WBAR (thank you), I was looking for a way how to change the header in file_get_contents() function a this is the answer I've found elsewhere. Now I am able to obtain the HTML of the site, hopefully I will manage parsing of this mess :D
<?php
libxml_use_internal_errors(true);
// Create a stream
$opts = array(
'http'=>array(
'user_agent' => 'PHP libxml agent', //Wget 1.13.4
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$content = file_get_contents('http://www.cinemacity.cz/', false, $context);
$dom = new DomDocument;
$dom->loadHTML($content);
if ($dom == FALSE) {
echo "FAAAAIL\n";
}
$xpath = new DOMXPath($dom);
$tags = $xpath->query("/html");
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
?>
Upvotes: 1
Views: 871
Reputation: 4984
The problem is not in PHP but in target host. It detects client's User-Agent header. Look at this:
wget http://www.cinemacity.cz/
2012-10-07 13:54:39 (1,44 MB/s) - saved `index.html.1' [234908]
but when remove User-Agent headers:
wget --user-agent="" http://www.cinemacity.cz/
2012-10-07 13:55:41 (262 KB/s) - saved `index.html.2' [4/4]
Only 4 bytes were returned by the server
Upvotes: 4
Reputation: 38
Try to get the contents this way:
function get2url($url, $timeout = 30, $port = 80, $buffer = 128) {
$arr = parse_url($url);
if(count($arr) < 3) return "URL ERROR";
$ssl = "";
if($arr['scheme'] == "https") $ssl = "ssl://";
$header = "GET " . $arr['path'] . "?" . $arr['query'] . " HTTP/1.0\r\n";
$header .= "Host: " . $arr['host'] . "\r\n";
$header .= "\r\n";
$f = @fsockopen($ssl . $arr['host'], $port, $errno, $errstr, $timeout);
if(!$f)
return $errstr . " (" . $errno . ")";
else{
@fputs($f, $header . $arr['query']);
$echo = "";
while(!feof($f)) { $echo .= @fgets($f, $buffer); }
@fclose($f);
return $echo;
}
}
You will have to remove the headers though.
Upvotes: 0