How do you use regex to split a string by a unicode character?

Question

I need help using regular expressions. Ihave read the Java Regex notes, but could not find a way around my problem.

PROBLEM: I have a String that needs to be split at all occurences of the unicode characters \0 , \1 and \2.

TRIED:

String msg ="foo\0foo\0bar\2foo\1horse"
msg.split("[\1\0\2]");

The above works perfectly (not sure if it is the correct use of regex), but

String msg ="foo\0foo\0bar\2foo\1horse\1123123\0123123\21"
msg.split("[\1\0\2]");

does not work correctly, as it seems the regex is picking up the \1k (with k any integer) instead of JUST the \0 and \1 and \2.

Any thoughts?

SOLVED: I found that the issue in testing was that I used my own generated String. Using \1 before the numeric values made String automatically include \1k as the character and not \1. When reading from my source, it came in bytes, and thus had the correct \1 encoded. When decoding and re-encoding (manually), I made the error. Working with the raw data solved the problem.

Alternatively I used the unicode \u0001-\u0002 to re-incode, and that worked as well. Thanks for all the answers. Learnt some about Regex and unicode.

maerics · Accepted Answer

Try using the Unicode character literal form (\uXXXX):

String msg ="foo\u0000bar\u0001gah\u0002zip\u0001horse\u0001123123\u0000456456\u00021";
String ss[] = msg.split("[\u0000-\u0002]");
// ss = ["foo", "bar", "gah", "zip", "horse", "123123", "456456", "1"];

How do you use regex to split a string by a unicode character?

Answers (2)

Related Questions