Perl fast HTML extract

Question

I'm wanting to extract data from several html pages, but am not familiar with HTML extraction. I have a working code that reads the entire page source and then removes the unwanted parts with regex, however it seems to be quite slow.

I'm reading financial information and only want to extract a single number from each page, so actually don't want to have to read the entire page each time if possible.

This is what I have in Perl:

use LWP::Simple;
my $mult;
my $url = 'http://www.wikinvest.com/stock/Apple_(AAPL)/Data/Net_Income/2014/Q1';

$content = get($url);

$content =~ s/\R//g; # remove linebreaks
$content =~ s/.*\//; # remove everything before tag
$content =~ s/\<.*//g; # remove everything after <...

if ($content =~ s/billion//) {$mult = 1e9;}
elsif ($content =~ s/million//) {$mult = 1e6;}
else {$mult = 1;}

$content =~ s/[^\d.-]//g; # keep numbers, commas and - only
$content = $content * $mult;

The get($url) command is quite slow as it extracts a lot of data, whereas I'm only interested in a single number. Is there a faster way to do this? I looked into HTML::TableExtract but I don't think the number I was extracting is in a standard HTML table. Also not sure if it would be any faster.

Perl fast HTML extract

Answers (1)

Related Questions