Reputation: 132
I am removing control characters from a string as I load and deserialise it. I do this with the following regex, which is fine:
\\p{C}
The issue is part of the text is meant to have new lines in it. So what I need to do is remove all control characters unless they fall between <Text>
and </Text>
.
How can do I do this with a regex?
Upvotes: 2
Views: 161
Reputation: 508
Here is a string I have to test regex patterns that remove control characters.
AAU?Aasddsaustw3h,kdf134dfswdesdfent?�sdfsadfa45678r?w3h,kdf134dfswdesdfawh,kdf134dfswdesdfsurew3h,kdf134dfswdesdfent??3asdfliit/123423defwecty ?�STasd?Pawh,kdf134dfswdesdfks?Hw3rsdfsd134dfswdet
It seems regex pattern "[[:cntrl:]]"
works well.
string.replaceAll("[\u0000-\u001f]", "")
just replace part of them.
"\p{Cntrl}"
just replace empty string after "wecty".
Can anyone told me what's those control characters are? I can replace them but could not figure out what are they. The jave online regex test show there are 11 control characters matched. https://www.freeformatter.com/java-regex-tester.html#ad-output
Upvotes: 0
Reputation: 9644
You could use
replaceAll("(?s)(<Text>.*?</Text>)|\\p{C}", "$1")
The idea is to skip Text
tags contents and leave them alone (replace them with themselves). So if we encounter a \\p{C}
, we know it's not inside one.
Explanation:
(?s)
activates "dot match all", so .
will match newline as well(<Text>.*?</Text>)
captures the text node in the first group. We replace with the result of this capture through $1
\\p{C}
, this means we are not in a Text node. So we replace with $1
, which is empty since (<Text>.*?</Text>)
didn't match in the alternation.Ideone illustration: http://ideone.com/xKZgsn
Upvotes: 3
Reputation: 68790
You could use this regex :
/(?!<text[^>]*?>)(\p{C}+)(?![^<]*?<\/text>)/gi
But, as mentioned by @fge, would be better to cleanly parse your input.
Upvotes: 0