Hans.Olo
Hans.Olo

Reputation: 67

REGEX replace leading spaces and tabs to html code in encoded xml per line, in Java

I would like to replace all leading spaces and tabs, in an encoded xml/html, per line with html-codes.

replace all groups of 4 spaces or every tabulator through tabulator (#09;) replace rest of spaces through space ( ) the replaces may/must be only on the start of each line, until the first non-space or tab character

Example

Begin of Line: (^|(\\r|\\n)+) => (\\r|\\n)+ multiple linebrakes can be wrapped

Replacmentment chars: [ ], [\t]

21 whitespaces = 5 x #09; + 1 x  
10 Whitespace + 1 tab + 6 whitespaces = 2x #09; + 2x   + 1x #09; + 1x 
#09; + 2x  

:: 10 spaces = 2 x #09 + 2x &nbsp
:: 1 tab = 1x #09
:: 6 spaces = 1 x #09 + 2x &nbsp

Input is an string, and will previously replaces by other regular expressions

text = text.replace(regex1, replacement1)
text = text.replace(regex2, replacement2)
text = text.replace(regex3, replacement3)
text = text.replace(regex4, replacement4)

at this position i must implement the new regular expression

Visual XML

<TEST>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
</TEST>

Encoded xml structure, from visual and so on input string

&lt;TEST&gt;
    &lt;NODE1&gt;
        &lt;VALUE&gt;         Test&lt;/VALUE&gt;
    &lt;/NODE1&gt;
    &lt;NODE1&gt;
        &lt;VALUE&gt;         Test&lt;/VALUE&gt;
    &lt;/NODE1&gt;
&lt;/TEST&gt;

Expected output

&lt;TEST&gt;
&#09;&lt;NODE1&gt;
&#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt; <- NOT replaced in <VALUE>
&#09;&lt;/NODE1&gt;
&#09;&lt;NODE1&gt;
&#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt; <- NOT replaced in <VALUE>
&#09;&lt;/NODE1&gt;
&lt;/TEST&gt;

i tried a lot,

tried and failed to store beginning of the line in regex-mempory, replace whitespaces groups

result: repeating beginning of the line and html coded spaces/tabs
example: \r&#09;\r&#09;\r&#09;\r&#09;
expected:\r&#09;&#09;&#09;&#09;

"(^|(\\r|\\n))[ ]{4}", "\\1&#09"

tried to to this in 2 line, first replace 4 spaces to tabs, tabs to tabs, and second replace the rest of spaces to &bnsp; but then it replaces every space tried the same, with "&#09;[ ]", "&#09;&nbps;"

i tried to do this with Matcher.find() loop and substring shows the best but not 100% correct results.

I fail and fail to get the correct regex, can anyone help?

Upvotes: 1

Views: 423

Answers (1)

Alexander Farber
Alexander Farber

Reputation: 22978

How about the following program using bunch of replaceAll methods and lookbehinds:

    public static void main (String[] args) {
        final String[] INPUT = new String[] {
"<TEST>",
"    <NODE1>",
"         <VALUE>         Test</VALUE>",                // 2 tabs 1 space here
"    </NODE1>",
"    <NODE1>",
"        <VALUE>         Test</VALUE>",
"    </NODE1>",
"</TEST>"
    };

        for (String str: INPUT) {
            System.out.println("NEW: " + htmlspecialchars(str));
        }
    }

    private static String htmlspecialchars(String str) {
        return str
            .replaceAll("&", "&quot;")                  // replace html entities
            .replaceAll("<", "&lt;")
            .replaceAll(">", "&gt;")
            .replaceAll("(?<=^\\s*)\t", "    ")         // replace tabs by 4 spaces
            .replaceAll("(?<=^\\s*)    ", "&#09;")      // replace 4 spaces by &#09;
            .replaceAll("(?<=^(?:&#09;)*) ", "&nbsp;"); // replace rest spaces by &nbsp;
    }

The resulting output is:

NEW: &lt;TEST&gt;
NEW: &#09;&lt;NODE1&gt;
NEW: &#09;&#09;&nbsp;&lt;VALUE&gt;         Test&lt;/VALUE&gt;
NEW: &#09;&lt;/NODE1&gt;
NEW: &#09;&lt;NODE1&gt;
NEW: &#09;&#09;&lt;VALUE&gt;         Test&lt;/VALUE&gt;
NEW: &#09;&lt;/NODE1&gt;
NEW: &lt;/TEST&gt;

Upvotes: 1

Related Questions