LaboDJ
LaboDJ

Reputation: 207

Perl regex on HTML markup

I just want to delete the block between

<!DOCTYPE html>

and

 <body>

including those ends, using a perl regex.

Example text:

<!DOCTYPE html>


<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style>code{white-space: pre;}</style>



<![endif]-->;

<body>
.
.
.
anything here

This is only a sample, my real file contains an embedded long javascript

I usually test my regex @ regex101 website and I made this one

<\!DOCTYPE html>(\n.*)*<body>

and this one that consider any space in the ends.

s/<\!DOCTYPE html>(\n.*)*<[ \t]*body[ \t]*>//gi;

It seems to work good on that website but it doesn't work when I run inside a perl script.

PERL SCRIPT (with @Jan answer):

#!/usr/bin/perl
use strict;
use warnings;

my $dirtfile = $ARGV[0];
my $cleanfile = "clean.html";

open(IN, "<", $dirtfile) or die "Can't open $dirtfile: $!";
open(OUT, ">", $cleanfile) or die "Can't open $cleanfile: $!";

while (<IN>) {
  s/(?s)<!DOCTYPE html>.+?<body>(?-s)//gi;
  print(OUT);
}

OUTPUT:

the same as input

Upvotes: 0

Views: 159

Answers (3)

mut3
mut3

Reputation: 54

I'm pretty sure you're reading the file line-by-line which should render your regex useless. I think you'll either need to read the entire file into a string and use regex that way, or edit your loop logic to remove everything before and after you see the tag.

In general, you should avoid working on HTML with regexes. Use a DOM extension instead.

Upvotes: 2

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118138

Since you are not really parsing HTML, but instead chopping a leading part of the file, you may get away with using regular expressions. This may get much more complicated if you have the target strings in any comments etc, but, if that is not the case, simply using the flip-flop operator .. should do it:

$ perl -ne 'print unless /<!DOCTYPE html>/i .. /<body>/i' file.html</pre>

Upvotes: 1

Jan
Jan

Reputation: 43169

It is usually considered bad practice to work with regular expressions on HTML, however you could nevertheless come up with:

(?s)<!DOCTYPE html>.+?<body>(?-s)
# switches on single line mode (aka dot matches all)
# takes <!DOCTYPE>
# everything afterwards lazily (.+?)
# including the body tag
# switch off single line mode off again

See a demo on regex101.com. It won't work as expected when there's a body tag somewhere in between (including comments, that is).

Upvotes: 0

Related Questions