UxBoD
UxBoD

Reputation: 35

Extract everything within HTML tag

I am having real problems trying to extract the text between a HTML header tag. I have the following Perl script which I am using to test:

#!/usr/bin/perl

my $text = '<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-    micr=osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=//www.w3.org  /TR/REC-html40"><head><META HTTP-EQUIV=3D"Content-Type" CONTENT==3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=oft Word 14 (filtered medium)">This is a test</HTML>';

my $html = "Add this first";
$text =~ /(<html .*>)(.*)/i;
print $text . "\n";

What I need to achieve is that the text between between the is extracted into $1 and what is left into $2. Then I can add in my text using print $1$myhtml$2

I just cannot get it to work :(

Upvotes: 0

Views: 359

Answers (2)

Nick Brunt
Nick Brunt

Reputation: 10057

Rather than using .* which will match the closing > as well, try [^>]* which matches anything but a closing >

However, in general regex is not the right way to parse HTML. It just doesn't work. There are so many variations in the way that HTML is written that you'll come up against a ridiculous number of problems.

The real solution is to parse the DOM tree and find what you want that way. Try using an XML parser.

Upvotes: 4

FailedDev
FailedDev

Reputation: 26930

if ($subject =~ m!<html[^>]*>(.*?)</html>!) {
    $result = $1;
}

Things to note. Your input starts with html and ends with HTML.. This cannot be.

Also if this is the ONLY tag you are considering extracting the you can use regex. However if you want to extract specific tags from inside the html/xhtml/xml etc. you should consider using one of the countless modules that are written for this job.

Upvotes: 0

Related Questions