Reputation: 1941
I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas?
Upvotes: 7
Views: 16186
Reputation: 22570
For similar Stackoverflow questions have a look at....
I do like using pQuery for things like this however Web::Scraper does look interesting.
Upvotes: 2
Reputation: 1047
You could also use this simple perl module WEB::Scraper, this is simple to understand and make life easy for me. follow this example for more information.
http://teusje.wordpress.com/2010/05/02/web-scraping-with-perl/
Upvotes: 2
Reputation: 63
Take a look at the magical Web::Scraper, it's THE tool for web scraping.
Upvotes: 1
Reputation: 1
I don't mean to drag up a dead thread but anyone googling across this thread should also checkout WWW::Scripter - 'For scripting web sites that have scripts'
happy remote data aggregating ;)
Upvotes: 1
Reputation: 16171
If you're familiar with XPath, you can also use HTML::TreeBuilder::XPath. And if you're not... well you should be ;--)
Upvotes: 2
Reputation: 46225
Although I've generally done this with LWP/LWP::Simple, the current 'preferred' module for any sort of webpage scraping in Perl is WWW::Mechanize.
Upvotes: 3
Reputation: 4872
I use LWP::UserAgent for most of my screen scraping needs. You can also Couple that with HTTP::Cookies if you need Cookies support.
Here's a simple example on how to get source.
use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);
$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
# Play with your source here
$source = $resp->content;
$source =~ s/^.*<table>/<table>/i; # this is just an example
print $source; # not a solution to your problem.
}
Upvotes: 0
Reputation: 7411
I have used HTML Table Extract in the past. I personally find it a bit clumsy to use, but maybe I did not understand the object model well. I usually use this part of the manual to examine the data:
use HTML::TableExtract;
$te = HTML::TableExtract->new();
$te->parse($html_string);
# Examine all matching tables
foreach $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach $row ($ts->rows) {
print join(',', @$row), "\n";
}
}`
Upvotes: 4
Reputation: 488664
If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:
## print every <h2> tag in page
use pQuery;
pQuery("http://google.com/search?q=pquery")
->find("h2")
->each(sub {
my $i = shift;
print $i + 1, ") ", pQuery($_)->text, "\n";
});
There's also HTML::DOM.
Whatever you do, though, don't use regular expressions for this.
Upvotes: 6