Reputation: 1740
I'm wanting to extract data from several html pages, but am not familiar with HTML extraction. I have a working code that reads the entire page source and then removes the unwanted parts with regex, however it seems to be quite slow.
I'm reading financial information and only want to extract a single number from each page, so actually don't want to have to read the entire page each time if possible.
This is what I have in Perl:
use LWP::Simple;
my $mult;
my $url = 'http://www.wikinvest.com/stock/Apple_(AAPL)/Data/Net_Income/2014/Q1';
$content = get($url);
$content =~ s/\R//g; # remove linebreaks
$content =~ s/.*\<div class="nv_lefty" id="nv_value">//; # remove everything before tag
$content =~ s/\<.*//g; # remove everything after <...
if ($content =~ s/billion//) {$mult = 1e9;}
elsif ($content =~ s/million//) {$mult = 1e6;}
else {$mult = 1;}
$content =~ s/[^\d.-]//g; # keep numbers, commas and - only
$content = $content * $mult;
The get($url)
command is quite slow as it extracts a lot of data, whereas I'm only interested in a single number. Is there a faster way to do this? I looked into HTML::TableExtract but I don't think the number I was extracting is in a standard HTML table. Also not sure if it would be any faster.
Upvotes: 0
Views: 109
Reputation: 5279
Have a look at Web::Scraper rather than using regexes. Something like this could save you a lot of time and will be less prone to errors.
Upvotes: 1