Reputation: 67
I would like to replace all leading spaces and tabs, in an encoded xml/html, per line with html-codes.
replace all groups of 4 spaces or every tabulator through tabulator (#09;) replace rest of spaces through space ( ) the replaces may/must be only on the start of each line, until the first non-space or tab character
Example
Begin of Line: (^|(\\r|\\n)+) => (\\r|\\n)+ multiple linebrakes can be wrapped
Replacmentment chars: [ ], [\t]
21 whitespaces = 5 x #09; + 1 x
10 Whitespace + 1 tab + 6 whitespaces = 2x #09; + 2x + 1x #09; + 1x
#09; + 2x
:: 10 spaces = 2 x #09 + 2x  
:: 1 tab = 1x #09
:: 6 spaces = 1 x #09 + 2x  
Input is an string, and will previously replaces by other regular expressions
text = text.replace(regex1, replacement1)
text = text.replace(regex2, replacement2)
text = text.replace(regex3, replacement3)
text = text.replace(regex4, replacement4)
at this position i must implement the new regular expression
Visual XML
<TEST>
<NODE1>
<VALUE> Test</VALUE>
</NODE1>
<NODE1>
<VALUE> Test</VALUE>
</NODE1>
</TEST>
Encoded xml structure, from visual and so on input string
<TEST>
<NODE1>
<VALUE> Test</VALUE>
</NODE1>
<NODE1>
<VALUE> Test</VALUE>
</NODE1>
</TEST>
Expected output
<TEST>
	<NODE1>
		 <VALUE> Test</VALUE> <- NOT replaced in <VALUE>
	</NODE1>
	<NODE1>
		 <VALUE> Test</VALUE> <- NOT replaced in <VALUE>
	</NODE1>
</TEST>
i tried a lot,
tried and failed to store beginning of the line in regex-mempory, replace whitespaces groups
result: repeating beginning of the line and html coded spaces/tabs
example: \r	\r	\r	\r	
expected:\r				
"(^|(\\r|\\n))[ ]{4}", "\\1	"
tried to to this in 2 line, first replace 4 spaces to tabs, tabs to tabs, and
second replace the rest of spaces to &bnsp; but then it replaces every space
tried the same, with "	[ ]", "	&nbps;"
i tried to do this with Matcher.find() loop and substring shows the best but not 100% correct results.
I fail and fail to get the correct regex, can anyone help?
Upvotes: 1
Views: 423
Reputation: 22978
How about the following program using bunch of replaceAll methods and lookbehinds:
public static void main (String[] args) {
final String[] INPUT = new String[] {
"<TEST>",
" <NODE1>",
" <VALUE> Test</VALUE>", // 2 tabs 1 space here
" </NODE1>",
" <NODE1>",
" <VALUE> Test</VALUE>",
" </NODE1>",
"</TEST>"
};
for (String str: INPUT) {
System.out.println("NEW: " + htmlspecialchars(str));
}
}
private static String htmlspecialchars(String str) {
return str
.replaceAll("&", """) // replace html entities
.replaceAll("<", "<")
.replaceAll(">", ">")
.replaceAll("(?<=^\\s*)\t", " ") // replace tabs by 4 spaces
.replaceAll("(?<=^\\s*) ", "	") // replace 4 spaces by 	
.replaceAll("(?<=^(?:	)*) ", " "); // replace rest spaces by
}
The resulting output is:
NEW: <TEST>
NEW: 	<NODE1>
NEW: 		 <VALUE> Test</VALUE>
NEW: 	</NODE1>
NEW: 	<NODE1>
NEW: 		<VALUE> Test</VALUE>
NEW: 	</NODE1>
NEW: </TEST>
Upvotes: 1