REGEX replace leading spaces and tabs to html code in encoded xml per line, in Java

Question

I would like to replace all leading spaces and tabs, in an encoded xml/html, per line with html-codes.

replace all groups of 4 spaces or every tabulator through tabulator (#09;) replace rest of spaces through space ( ) the replaces may/must be only on the start of each line, until the first non-space or tab character

Example

Begin of Line: (^|(\r|\n)+) => (\r|\n)+ multiple linebrakes can be wrapped

Replacmentment chars: [ ], [	]

21 whitespaces = 5 x #09; + 1 x  
10 Whitespace + 1 tab + 6 whitespaces = 2x #09; + 2x   + 1x #09; + 1x 
#09; + 2x  

:: 10 spaces = 2 x #09 + 2x  
:: 1 tab = 1x #09
:: 6 spaces = 1 x #09 + 2x

Input is an string, and will previously replaces by other regular expressions

text = text.replace(regex1, replacement1)
text = text.replace(regex2, replacement2)
text = text.replace(regex3, replacement3)
text = text.replace(regex4, replacement4)

at this position i must implement the new regular expression

Visual XML

Encoded xml structure, from visual and so on input string

<TEST>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
    <NODE1>
        <VALUE>         Test</VALUE>
    </NODE1>
</TEST>

Expected output

<TEST>
	<NODE1>
		 <VALUE>         Test</VALUE> <- NOT replaced in 
	</NODE1>
	<NODE1>
		 <VALUE>         Test</VALUE> <- NOT replaced in 
	</NODE1>
</TEST>

i tried a lot,

tried and failed to store beginning of the line in regex-mempory, replace whitespaces groups

result: repeating beginning of the line and html coded spaces/tabs
example: 
	
	
	
	
expected:
				

"(^|(\r|\n))[ ]{4}", "\1	"

tried to to this in 2 line, first replace 4 spaces to tabs, tabs to tabs, and second replace the rest of spaces to &bnsp; but then it replaces every space tried the same, with " [ ]", " &nbps;"

i tried to do this with Matcher.find() loop and substring shows the best but not 100% correct results.

I fail and fail to get the correct regex, can anyone help?

Alexander Farber · Accepted Answer

How about the following program using bunch of replaceAll methods and lookbehinds:

    public static void main (String[] args) {
        final String[] INPUT = new String[] {
"",
"    ",
"                  Test",                // 2 tabs 1 space here
"    ",
"    ",
"                 Test",
"    ",
""
    };

        for (String str: INPUT) {
            System.out.println("NEW: " + htmlspecialchars(str));
        }
    }

    private static String htmlspecialchars(String str) {
        return str
            .replaceAll("&", """)                  // replace html entities
            .replaceAll("<", "<")
            .replaceAll(">", ">")
            .replaceAll("(?<=^\s*)	", "    ")         // replace tabs by 4 spaces
            .replaceAll("(?<=^\s*)    ", "	")      // replace 4 spaces by 	
            .replaceAll("(?<=^(?:	)*) ", " "); // replace rest spaces by  
    }

The resulting output is:

NEW: <TEST>
NEW: 	<NODE1>
NEW: 		 <VALUE>         Test</VALUE>
NEW: 	</NODE1>
NEW: 	<NODE1>
NEW: 		<VALUE>         Test</VALUE>
NEW: 	</NODE1>
NEW: </TEST>

REGEX replace leading spaces and tabs to html code in encoded xml per line, in Java

Answers (1)

Related Questions