Reputation: 906
I want to use sockets to open a link and read the html code, so far i am using this :
my $req = <<EOT
GET / ${id} HTTP/1.1
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:identity
Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Connection:${connection}
Host:${host}
User-Agent:Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36
EOT
;
$socket->send($req);
print "skipping headers\n";
while(<$socket>) { s/^(.*?)\r?\n$/\1/; last if /^\s*\r?\n?$/; }
print "Reading Chunks\n";
my $buffer = "";
while(<$socket>)
{
last if /^HTTP/;
next if /^.{0,5}$/;
s/^\s*(.*?)\s*\r?\n$/\1/;
$buffer .= $_;
}
print $buffer;
I have two problems ...
1) while(<$socket>)
takes a looot of time, and when i put the print inside the while loop, i can see that it takes a while just to add the last tag </html>
, when all the other source is printed, it just hangs for like a minute to add the last tag.
2) I dont get the real source code of the page, i mean the one we get using view-source:www.example.com
, i am missing something?
EDIT :
And i call this sub connect to create the socket in the beginning
sub _connect
{
my ($peerAdd) = @_;
return IO::Socket::INET->new(
PeerAddr => $peerAdd,
PeerPort => 'http(80)',
Proto => 'tcp'
)
or die "Could not connect to $peerAdd:80!! $!"
}
Thanks in advance.
Upvotes: 0
Views: 174
Reputation: 123639
You send a HTTP/1.1 request which is by default keep-alive, e.g. the server keeps the connection open and waits for more requests. Thus the last call will only end once the server closes the connection because of inactivity, long after the last bytes of the request got received.
If your are lazy you should just use LWP::UserAgent or similar modules. If you instead want to do everything by hand you have do deal with all the messy stuff yourself, e.g. chunked encodings, compressed transfers, lots of non-standard servers etc. This is far from trivial.
Upvotes: 5