Perl WWW::Mechanize Parse Content issue?

Question

I am using WWW::Mechanize library from Perl to scrape the content from a website. However, I noticed that the original HTML Source Code of the web page and what is parsed by WWW::Mechanize, differ. As a result of this, some of the functionality in my script gets broken.

So, here is the script (a subset, just to demonstrate the error/issue)

#! /usr/bin/perl

use WWW::Mechanize;
use warnings;

$mech=WWW::Mechanize->new();
$mech->stack_depth(0);

$url="http://www.example.com";

$mech->get($url);

print $mech->content;

Short and simple code, it will connect to the website and retrieve the entire HTML page.

I run the script and redirect the output to a text file so that I can analyze them.

perl test.pl >> source_code.txt

Now, when I compare the source_code.txt and the actual source code of the website as displayed by the Browser (Firefox), there are differences.

For instance:


This is Some Text
Some more Text

The above source code is what is observed in the Browser. (View Page Source Feature)

However, in the text file, source_code.txt (generated by WWW::Mechanize)

it shows:


This is some text
This is some more text

As you can see, the anchor tag which was nested between the second set of tags was deleted.

Is this is a known issue or do I Need to use some thing else other than $mech->content to view the source code?

Thanks.

h4ck3rm1k3 · Accepted Answer

This is a common behavior known as "user agent sniffing", for example for blind users a page will be displayed differently. You can change your user agent strings in the browser with different plugins and also as @LHMathies said, in WWW::Mechanize see UserAgent.pm and Mechanize->new

Example :

my $mech = WWW::Mechanize->new( agent => 
     'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)' 
);

see also a list of common user agent strings.

Perl WWW::Mechanize Parse Content issue?

Answers (1)

Related Questions