how to bypass html escape signs and extract text only from html file in perl using web::scraper

Question

I am trying to extract the text only from the html page and want to ignore or bypass the html escape signs "<" and ">". I am copying the part the html page that i used for extraction of text:

        
...
My perl code is:
my $urlToScrape = "http://www.w3schools.com/tags/";

# prepare data
my $teamsdata = scraper {
process "table.reference > tr > td > a ", 'tags[]' => 'TEXT';
process "table.reference > tr > td > a ", 'urls[]' => '@href';
};

# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));

print "
";
for my $i ( 0 .. $#{$res->{urls}}) {
 print FILE "    $res->{tags}[$i] 
 ";
}
print "
";
The output I get is the following:

      
          
          
          

whereas I want output as:

     !--...-- 
         !DOCTYPE 
         a 
         abbr 

Can anyone tell what do I have to change inorder to get the above output?
Many Thanks.

    
    Tag
    Description
    
    
    <!--...-->
    Defines a comment
    
    
    <!DOCTYPE> 
    Defines the document type
    
    
    <a>
    Defines a hyperlink
    
    
    <abbr>
    Defines an abbreviation

Krishnachandra Sharma · Accepted Answer

Brute Force:

$res->{tags}[$i] =~ s/[\<\>]//gs; ## Added line 
print FILE "    $res->{tags}[$i] 
 ";

how to bypass html escape signs and extract text only from html file in perl using web::scraper

Answers (1)

Related Questions