Reputation: 20667

Compress whitespace between attributes in an HTML tag

We just released some code to make our software a little bit more user friendly, and it backfired. Basically, we're attempting to replace newlines with <br /> tags. The trouble is, sometimes our users will enter code like the following:

<a
 href='http://nowhere.com'>Nowhere</a>

When we run our code, this translates to

<a <br />href='http://nowhere.com' />Nowhere</a>

which obviously doesn't render properly.

Is there a regular expression or a PHP function to strip, or perhaps compress, the whitespace between the attributes of an HTML tag?

Clarification: This isn't full HTML. It's more similar to Markdown or some other language (we will eventually be moving to Markdown, but I need a quick fix). So I can't just parse this as regular HTML. The newlines need to be converted to <br /> tags properly.

Upvotes: 2

Answers (4)

Topher Fangio

Reputation: 20667

After some searching and much trial and error, I have come up with the following solution/hack:

/*
 * Compress all whitespace within HTML tags (including PRE at the moment)
 */
$regexp = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

preg_match_all($regexp, $text, $matches);

foreach($matches[0] as $match) {
  $new_html = preg_replace('/\s+/', ' ', $match);
  $text = str_replace($match, $new_html, $text);
}

After executing this code, all HTML tags in $text will be properly formatted and valid with NO newline characters.

I know that this isn't the best solution, but it works, and pretty soon we'll be migrating to a true markup language (such as Markdown).

Upvotes: 1

ChrisJ

Reputation: 5251

Ideally, you would use an XML parser, through DOM or SAX APIs. However, if your content is not proper XML, but plain text with a few tags, the parser may fail (it depends on the tool used, I guess).

A rough solution for your particular problem may be as follows: construct a state machine with two states, inside a tag and outside a tag. You read the input character by character. Upon reading '<', switch to the "inside" state. Upon reading '>', switch to the "outside" state. Upon reading '\n' and if in the "outside" state, emit "<br />" (otherwise emit nothing).

This is just a sketch, and it may need to be refined.

Upvotes: 0