Mark Segal
Mark Segal

Reputation: 5550

How to download a website in PHP - I have a small problem. (or major?)


I am learning PHP, and I'm trying to make an application that has a relationship with an external website.
I need to download it.
So I got this code:

$str = file_get_contents($url);


Which should return me the HTML contents of a website.
it works fine for most websites, but for a particular one - http://www.fxp.co.il - it shows crap.
What is the problem ? What can I do to fix it ?
Thank you ! enter image description here

Upvotes: 1

Views: 137

Answers (1)

hakre
hakre

Reputation: 197767

Well, you should actually inspect the response headers as they tell you about the encoding of the data returned file_get_contents.

For example, if it's gzip encoded, you need to uncompress it.

Normally you won't notice that because file_get_contents() sends a request in a way that the server knows that it does not support compression.

However some servers just do not care and send you compressed responses anyway:

<?php

$url = 'http://www.fxp.co.il/';

$buffer = file_get_contents($url);

echo $url, '<hr>', '<pre>', implode("\n", $http_response_header), '</pre>';

$bare = gzdecode($buffer);

echo '<hr>', htmlspecialchars(substr($bare, 0, 256));

Output:

http://www.fxp.co.il/
------------------------------------------------------------
HTTP/1.1 200 OK
Server: nginx/0.7.67
Date: Mon, 29 Aug 2011 19:19:55 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Set-Cookie: bb_lastvisit=1314607056; expires=Tue, 28-Aug-2012 19:12:44 GMT; path=/
Set-Cookie: bb_lastactivity=0; expires=Tue, 28-Aug-2012 19:12:44 GMT; path=/
X-Accel-Expires: 600
Cache-control: must-revalidate, post-check=0, pre-check=0
Pragma: cache
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 14170
Expires: Tue, 24 Jan 1984 08:00:00 GMT
X-Header: Boost Citrus 1.9
Cache-Control: must-revalidate, post-check=0, pre-check=0
------------------------------------------------------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" dir="rtl" lang="he"> <head> <meta http-equiv="Content-Type" content="text/html; charset

Take care!

Upvotes: 2

Related Questions