NeverImageProcess
NeverImageProcess

Reputation: 33

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.

Example text:

\u2605 StatTrak\u2122 Shadow Daggers

Example Desired Result:

StatTrak Shadow Daggers

The current regex code I have that will not work:

list.replaceAll("\\\\u[0-9]+","");

The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2 \u[0-9]+

I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.

Any help would be appreciated.

Upvotes: 1

Views: 300

Answers (3)

NeverImageProcess
NeverImageProcess

Reputation: 33

I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.

What I had previously:

list.replaceAll("\\\\u[0-9]+","");

What I needed:

list = list.replaceAll("\\\\u[0-9]+","");

Result works fine now, thanks for the help.

Upvotes: 0

Stephen P
Stephen P

Reputation: 14800

In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.

Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.

This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.

Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");

Upvotes: 1

Kenny Tai Huynh
Kenny Tai Huynh

Reputation: 1599

Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.

Sample would be:

list = list.replaceAll("\\P{Print}", "");

Hope this help.

Upvotes: 1

Related Questions