OrangePot
OrangePot

Reputation: 1082

Java - Tokenizing by regex

Im trying to tokenize strings of the following format:

"98, BA71V-CP204L (p32, p30), BA71V-CP204L (p32, p30), , 0, 125900, 126505"
"91, BA71V-B175L, BA71V-B175L, , 0, 108467, 108994,   -, 528, 528"

Each of the tokens will then be stored in a string array. The strings are to be tokenized by "," excluding those that are inside ( , ) so that the contents of ( , ) would belong in a token. The tokens may also only contain a space.

Im thinking the reg-ex would find a comma, then check if it is surrounded on the left by a opening parenthesis, and on the right by an closing parenthesis. Since this comma is contained by some ( ), it would not be used to tokenize.

I could have a regex for the opposite, but what about the time where neither sides of the delimiter contain "(" or ")"?

Currently am using:

StringTokenizer tokaniza = new StringTokenizer(content,","); //no regex

but i feel as though regex go better with

content.split();

Upvotes: 0

Views: 92

Answers (2)

Avinash Raj
Avinash Raj

Reputation: 174706

Use a negative lookahead assertion.

String s = "98, BA71V-CP204L (p32, p30), BA71V-CP204L (p32, p30), , 0, 125900, 126505";
String parts[] = s.split(",(?![^()]*\\))");
System.out.println(Arrays.toString(parts));

Output:

[98,  BA71V-CP204L (p32, p30),  BA71V-CP204L (p32, p30),  ,  0,  125900,  126505]

Upvotes: 2

Culpepper
Culpepper

Reputation: 1111

Try a split using:

(?<!\(\w{1,4}),(?!\s*\w*\)).*?

The only thing, Java doesn't support infinite repetitions inside look-behinds you have to specify the number of characters inside the parenthesis (i.e. \w{1,4}). In other words this will break if your characters inside of the parenthesis exceed 4.

Upvotes: 1

Related Questions