Reputation: 3233
I am using WWW::Mechanize library from Perl to scrape the content from a website. However, I noticed that the original HTML Source Code of the web page and what is parsed by WWW::Mechanize, differ. As a result of this, some of the functionality in my script gets broken.
So, here is the script (a subset, just to demonstrate the error/issue)
#! /usr/bin/perl
use WWW::Mechanize;
use warnings;
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
$url="http://www.example.com";
$mech->get($url);
print $mech->content;
Short and simple code, it will connect to the website and retrieve the entire HTML page.
I run the script and redirect the output to a text file so that I can analyze them.
perl test.pl >> source_code.txt
Now, when I compare the source_code.txt and the actual source code of the website as displayed by the Browser (Firefox), there are differences.
For instance:
<tr>
<td nowrap="nowrap">This is Some Text</td>
<td align="right"><a href="http://example.com?value=key">Some more Text</a></td>
</tr><tr>
The above source code is what is observed in the Browser. (View Page Source Feature)
However, in the text file, source_code.txt (generated by WWW::Mechanize)
it shows:
<tr>
<td nowrap="nowrap">This is some text</td>
<td align="right">This is some more text</td>
</tr><tr>
As you can see, the anchor tag which was nested between the second set of tags was deleted.
Is this is a known issue or do I Need to use some thing else other than $mech->content to view the source code?
Thanks.
Upvotes: 2
Views: 923
Reputation: 2100
This is a common behavior known as "user agent sniffing", for example for blind users a page will be displayed differently. You can change your user agent strings in the browser with different plugins and also as @LHMathies said, in WWW::Mechanize see UserAgent.pm and Mechanize->new
Example :
my $mech = WWW::Mechanize->new( agent =>
'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)'
);
see also a list of common user agent strings.
Upvotes: 4