Reputation: 1844

Using regular expression to trim html

Been trying to solve this for a while now.

I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below:

Source:

<html>
   <head>
     <title>
           Some title
       </title>
    </head>
</html>

Wanted result:

<html><head><title>Some title</title></head></html>

The trimming of the whitespaces before the "Some title" is optional. I'd be grateful for any help

Upvotes: 1

Answers (9)

Shash

Reputation: 1

I wanted to preserve the new lines, since the removal of newlines was messing up my html. So I went with the following. .

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}

Upvotes: -1

Philipp

Reputation: 4729

A solution with XSLT would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">    
<xsl:output  method="xml" encoding="UTF-8" indent="no"/>

<xsl:template match="*|@*">
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<!-- trim whitespaces from the content -->
<xsl:template match="text()">
    <!-- remove from tag to content -->
    <xsl:variable name="trimmedHead" select="replace(.,'^\s+','')"/>
    <xsl:variable name="trimmed" select="replace($trimmedHead,'\s+$','')"/>
    <xsl:value-of select="$trimmed"/>
</xsl:template>

<!-- do not trim where text content exist -->
<xsl:template match="text()">
    <xsl:if test="not(matches(.,'^\s+$'))">
        <xsl:value-of select="."/>
    </xsl:if>
</xsl:template>

You can choose the template you would like to use. The first removes all whitespaces also when content exists, and the second one removes only when there are just whitespaces or newlines.

Upvotes: 0

user105033

Reputation: 19568

Try this:

s/[^\w\/\d<>]+/gs

Upvotes: 0

dankyy1

Reputation: 1194

Regex.Replace(input, "<[^>]*>", String.Empty);

Upvotes: 0

Chas. Owens

Reputation: 64909

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the HTMLAgilityPack answer helpful.

Upvotes: 1

Bran Handley

Reputation: 153

This removes the whitespace between tags and the space between the tags and the text.

s/(\s*(<))|((>)\s*)/\2\4/g

Upvotes: 0

ʞɔıu

Reputation: 48386

s/\s*(<[^>]+>)\s*/\1/gs

or, in c#:

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

Upvotes: 0

Welbog

Reputation: 60398

If the HTML is strict, load it with an XML reader and write it back without formatting. That will preserve the whitespace within tags, but not between them.

Upvotes: 20

JSBձոգչ

Reputation: 41378

s/>\s+</></gs

Upvotes: 0

Using regular expression to trim html

Answers (9)

Related Questions