varun
varun

Reputation: 2064

how to strip out white spaces from html document

I am trying to find a valid regular expression that i can use to strip out all the white spaces or new line characters.

Below is something I tried.

((\s|\n|\r)?<(\s|\n|\r)?)|(\s|\n|\r)?>(\s|\n|\r)

on this document

< tag src="abc" testattribute >


<script > any script </script >

<tag2>what is this </tag2>
<tag>

I want the end result to be exactly this.

<tag src="abc" testattribute><script>any script</script><tag2>what is this</tag2><tag>

Upvotes: 0

Views: 73

Answers (1)

hwnd
hwnd

Reputation: 70732

You can simply use \s here to match for whitespace.

\s matches whitespace (\n, \r, \t, \f, and " ")

Depending on the language you are using, you can use assertions for this.

(?<=<|>)\s*|(?<!>|<)\s*(?![^><])

See live demo

Regular expression:

(?<=           look behind to see if there is:
 <             '<'
  |             OR
 >             '>'
)              end of look-behind
 \s*           whitespace (\n, \r, \t, \f, and " ") (0 or more times)
 |             OR
(?<!           look behind to see if there is not:
 >             '>'
  |            OR
 <             '<'
)              end of look-behind
 \s*           whitespace (\n, \r, \t, \f, and " ") (0 or more times)
 (?!           look ahead to see if there is not:
  [^><]        any character except: '>', '<'
 )             end of look-ahead

Upvotes: 2

Related Questions