Alex Dn
Alex Dn

Reputation: 5553

Regex replace whitespace in HTML document

I saw many similar question, but still not found the answer.
How should look the regex, that needs to replace all whitespaces (include newline) in HTML, but ignore the tag?

Currently I use Regex.Replace(content, @"\s+", ""); but in removes spaces in JavaScript that exists on page and than the page not works.

Thank you.

EDIT: After some question in responses, here a little bit more details: What I'm doing is HTTP module that "minifies" HTML output on our site. We have a web site with very dynamic content that came from many different sources. The final goal, is to reduce page size and reduce network traffic. It's a highly loaded web site so it's important to us to complete that.

Actually we are using MbCompression library for JS and CSS minification, but it not supports to minify HTML output (at least i didn't found).

Upvotes: 0

Views: 3233

Answers (6)

sainiuc
sainiuc

Reputation: 1697

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

There are risks related to tags, unclosed tags etc. I hope you have some control over the 'dynamic content that comes from different sources' as you've put it. I also hope that you've tried everything else and this comes as a last resort.

Upvotes: 0

Mike Samuel
Mike Samuel

Reputation: 120516

If you can find a decent HTML parser, I would do it via DOM manipulation. If you can't, then something like

Regex.Replace(content, "(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+", "$1");

should do it. It will not remove spaces inside tags or inside embedded JS, CSS, or inside textareas but will remove newlines in text nodes.

Upvotes: 1

ZZ-bb
ZZ-bb

Reputation: 2167

What's your goal? Browsers ignore a lot of whitespace when rendering pages so I'm guessing you want to clean up your source code. If so, check if the program you use offers some solution to this. For example Dreamweaver has a tool to reformat source code.

Tidy could be one option but it looks like it's a bit more than a simple code formatting tool.

Upvotes: 1

mmuratusta
mmuratusta

Reputation: 100

Regex.Replace(document.body.innerHTML, @"\s+", "");

using document.body.innerHTML instead may work. I am not sure.

Upvotes: 0

perh
perh

Reputation: 1708

There is really no way to write a single (reasonable) regexp to do this. Especially not if you want to support javascript and css. You need to have a real parser.

Upvotes: 2

Chris
Chris

Reputation: 27609

Surely you should be replacing it with a space at least, not just removing whitespace entirely. For HTML that should be fine but if you are talking about having strings in javascript with multiple spaces not being collapsed then you need to think of another method since regular expressions won't work out easily whether you are in script, in a string, etc.

That having been said I'm not sure of a good reason to do this. If you are worried about the size of the file then just tell your server to use compression which I suspect by now every browser supports well enough and the pages will basically be zipped by the server and unzipped on the client. Its a bit more work for the server so it depends if you care about bandwidth or CPU more.

Upvotes: 0

Related Questions