Alex Baranosky
Alex Baranosky

Reputation: 50074

Matching a regex in html, ignoring spaces, and quotation marks

I need to find a certain chunk from a group of HTML files and delete it from them all. The files are really hacked up HTML, so instead of parsing it with HtmlAgility pack as I was trying before, I would like to use a simple regex.

the section of html will always look like this:

<CENTER>some constant text <img src=image.jpg> more constant text: 
 variable section of text</CENTER>

All of the above can be any combination of upper and lower case, and notice that it is img src=image.jpg and not img src="image.jpg"... And there can be any number of white space characters in between the constant characters.

here are some examples:

    <CENTER>This page has been visited 
<IMG SRC=http://place.com/image.gif ALT="alt text">times since 10th July 2007
</CENTER>

or

    <center>This page has been visited 
<IMG src="http://place.com/image.gif" Alt="Alt Text"> 
times since 1st October 2005</center> 

What do you think would be a good way to match this pattern?

Upvotes: 1

Views: 1086

Answers (3)

Renaud Bompuis
Renaud Bompuis

Reputation: 16786

In C# you could simply use this, assuming that originalHTML contains your whole HTML file.

string result = null;
result = Regex.Replace(originalHtml,
                       @"(\s*<center>[^<]*<img src=[^""].*?>.*?</center>\s*)", 
                       "", 
                       RegexOptions.Singleline | RegexOptions.IgnoreCase);

The Regex will remove any occurrence of the pattern in the original HTML and return the modified version.

Upvotes: 0

qpingu
qpingu

Reputation: 960

It really depends on how simple you can make the regex and match the desired elements.

<center>[^<]+<img[^>]+>[^>]+</center>

Use the case-insensitive flag too (I don't know what C# uses). If you need something more developed because you'll have situations where an img tag sits within center tags and not match, then you can start hardcoding phrases like the other answer.

Upvotes: 1

Alan Moore
Alan Moore

Reputation: 75242

How much of that text is needed to uniquely identify the target? I would try this first:

@"(?is)<center>\s*This\s+page\s+has\s+been\s+visited.*?</center>"

Upvotes: 2

Related Questions