Jeya Suriya Muthumari
Jeya Suriya Muthumari

Reputation: 2021

Perl script to extract particular div section from HTML code

I am having a HTML file which is very large. I need to extract particular <div>...</div> section in a variable.

##some contents
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>
##some contents

I used regex, like

my $html_filepath = "G:\\Report.html";
open(HTML, "<$html_filepath") or die "Can't open $html_filepath $!\n";
$body .= "\nTest Report Summary:\n\n";
my $content;
my $summarySection;
{
    local $/ = undef; # slurp mode
    $content = <HTML>;
}
$content =~ s/\r\n//g;
#print $content;

if ($content ne "")
{
    if ($content =~ m/<div class="title-bar" (.*)/)
    #if ( $last_line =~ m/^<tr> <td>(\d+)<\/td>/ )
    {
        $summarySection = "$1";
    }
}
print "\n $summarySection";

Output I got is:

<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016

But I need the output like,

<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>

I have tried the following regex,

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/)

But this did not work.

Please give me some ideas to get the content including the line break, newline and white space.

Upvotes: 0

Views: 836

Answers (1)

bolav
bolav

Reputation: 6998

Please don't use regexp to parse HTML. Use a perl module to parse HTML.

Something like HTML::TreeBuilder:

use strict;
use warnings;
use HTML::TreeBuilder 5 -weak; # Ensure weak references

my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($html_filepath);
my $elem = $tree->look_down('_tag' => 'div', 'class' => 'title-bar');
warn $elem->as_HTML;

The problem with your regexp is that . does not match newline. Read this to know how to match all characters: Regex to match any character including new lines

The way to fix this is using the s (Treat string as single line) modifier:

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/s)

Upvotes: 4

Related Questions