Bharath Keshava
Bharath Keshava

Reputation: 62

Retrieving HTTP URLs using Perl scripting

I'm trying to save the whole web page on my system as a .html file and then parse that file, to find some tags and use them.

I'm able to save/parse http://<url>, but not able to save/parse https://<url>. I'm using Perl.

I'm using the following code to save HTTP and it works fine but doesn't work for HTTPS:

use strict; 
use warnings; 
use LWP::Simple qw($ua get);
use LWP::UserAgent;
use LWP::Protocol::https;
use HTTP::Cookies;

sub main
{
  my $ua = LWP::UserAgent->new();

  my $cookies = HTTP::Cookies->new(
    file => "cookies.txt",
    autosave => 1,
    );
 
  $ua->cookie_jar($cookies);
 
  $ua->agent("Google Chrome/30");
 

#$ua->ssl_opts( SSL_ca_file => 'cert.pfx' );

  $ua->proxy('http','http://proxy.com');
  my $response = $ua->get('http://google.com');

#$ua->credentials($response, "", "usrname", "password");
 
  unless($response->is_success) {
    print "Error: " . $response->status_line;
    }
 
         
    # Let's save the output.
  my $save = "save.html";
 
  unless(open SAVE, '>' . $save) {
    die "nCannot create save file '$save'n";
  }
 
    # Without this line, we may get a
    # 'wide characters in print' warning.
  binmode(SAVE, ":utf8");
 
  print SAVE $response->decoded_content;
 
  close SAVE;
 
  print "Saved ",
      length($response->decoded_content),
      " bytes of data to '$save'.";
}

main();

Is it possible to parse an HTTPS page?

Upvotes: 1

Views: 3340

Answers (2)

Dave Cross
Dave Cross

Reputation: 69264

Always worth checking the documentation for the modules that you're using...

You're using modules from libwww-perl. That includes a cookbook. And in that cookbook, there is a section about HTTPS, which says:

URLs with https scheme are accessed in exactly the same way as with http scheme, provided that an SSL interface module for LWP has been properly installed (see the README.SSL file found in the libwww-perl distribution for more details). If no SSL interface is installed for LWP to use, then you will get "501 Protocol scheme 'https' is not supported" errors when accessing such URLs.

The README.SSL file says this:

As of libwww-perl v6.02 you need to install the LWP::Protocol::https module from its own separate distribution to enable support for https://... URLs for LWP::UserAgent.

So you just need to install LWP::Protocol::https.

Upvotes: 5

lsiebert
lsiebert

Reputation: 667

You need to have https://metacpan.org/module/Crypt::SSLeay for https links

It provides SSL support for LWP.

Bit me in the ass with a project of my own.

Upvotes: 0

Related Questions