Reputation: 148
I am trying to parse the following link using perl
http://www.inc.com/profile/fuhu
I am trying to get information like Rank, 2013 Revenue and 2010 Revenue, etc, But when fetch data with perl, I get following and same shows in Page Source Code.
<dl class="RankTable">
<div class="dtddwrapper">
<div class="dtdd">
<dt>Rank</dt><dd><%=rank%></dd>
</div>
</div>
<div class="dtddwrapper">
And When I check with Firebug, I get following.
<dl class="RankTable">
<div class="dtddwrapper">
<div class="dtdd">
<dt>Rank</dt><dd>1</dd>
</div>
</div>
<div class="dtddwrapper">
My Perl code is as following.
use WWW::Mechanize;
$url = "http://www.inc.com/profile/fuhu";
my $mech = WWW::Mechanize->new();
$mech->get( $url );
$data = $mech->content();
print $data;
Upvotes: 2
Views: 182
Reputation: 2732
As other have said this is not plain HTML, there is some JS wizardry. The data comes from a dynamic JSON request.
The following script prints the rank and dumps everything else available in $data
.
First it gets the ID of the profile and then it makes the appropriate JSON request, just like a regular browser.
use strict;
use warnings;
use WWW::Mechanize;
use JSON qw/decode_json/;
use Data::Dumper;
my $url = "http://www.inc.com/profile/fuhu";
my $mech = WWW::Mechanize->new();
$mech->get( $url );
if ($mech->content() =~ /profileID = (\d+)/) {
my $id = $1;
$mech->get("http://www.inc.com/rest/inc5000company/$id/full_list");
my $data = decode_json($mech->content());
my $rank = $data->{data}{rank};
print "rank is $rank\n";
print "\ndata hash value \n";
print Dumper($data);
}
Output:
rank is 1
data hash value
$VAR1 = {
'time' => '2014-08-22 11:40:00',
'data' => {
'ifi_industry' => 'Consumer Products & Services',
'app_revenues_lastyear' => '195640000',
'industry_rank' => '1',
'ifc_company' => 'Fuhu',
'current_industry_rank' => '1',
'app_employ_fouryearsago' => '49',
'ifc_founded' => '2008-00-00',
'rank' => '1',
'city_display_name' => 'Los Angeles',
'metro_rank' => '1',
'ifc_business_model' => 'The creator of an Android tablet for kids and an Adobe Air application that allows children to access the Internet in a parent-controlled environment.',
'next_id' => '25747',
'industry_id' => '4',
'metro_id' => '2',
'app_employ_lastyear' => '227',
'state_rank' => '1',
'ifc_filelocation' => 'fuhu',
'ifc_url' => 'http://www.fuhu.com',
'years' => [
{
'ify_rank' => '1',
'ify_metro_rank' => '1',
'ify_industry_rank' => '1',
'ify_year' => '2014',
'ify_state_rank' => '1'
},
{
'ify_industry_rank' => undef,
'ify_year' => '2013',
'ify_rank' => '1',
'ify_metro_rank' => undef,
'ify_state_rank' => undef
}
],
'ifc_twitter_handle' => 'NabiTablet',
'id' => '22890',
'app_revenues_fouryearsago' => '123000',
'ifc_city' => 'El Segundo',
'ifc_state' => 'CA'
}
};
Upvotes: 3
Reputation: 739
This thing : <%=rank%> is inside a script, it's not HTML. So when you see it in firebug, it shows after executing this part. But when you look at the HTML code, you see it this way. So HTML parsing won't work here.
Usually in this type of cases, the variables (rank for example) are passed from server using a XHR call. So you need to check the XHR calls in firebug and see the responses.
Upvotes: 1