dustydojo
dustydojo

Reputation: 719

In Java, how do you tokenize a string that contains the delimiter in the tokens?

Let's say I have the string:

String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";

I want the tokens:

  1. prop1=value1
  2. prop2=String test='1234';int i=4;
  3. prop3=value3

For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:

String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";

But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.

I've tried escaping the non-delimiter:

String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";

But then there is an ugly mess of removing the escape characters.

Do you have any suggestions?

Upvotes: 1

Views: 98

Answers (2)

Bentaye
Bentaye

Reputation: 9766

Prerequisite:

  • All your tokens start with prop

  • There is no prop in the file other than the beginning of a token

I'd just do a replace of all ;prop by ~prop

Then your string becomes:

"prop1=value1~prop2=String test='1234';int i=4~prop3=value3";

You can then tokenize using the ~ delimiter

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You may match either <![CDATA...]]> or any char other than ;, 1 or more times, to match the values. To match the keys, you may use a regular \w+ pattern:

(\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)

See the regex demo.

Details

  • (\w+) - Group 1: one or more word chars
  • = - a = sign
  • ((?:<!\[CDATA\[.*?]]>|[^;])+) - Group 1: one or more sequences of
    • <!\[CDATA\[.*?]]> - a <![CDATA[...]]> substring
    • | - or
    • [^;] - any char but ;

See a Java demo:

String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(1) + " => " + matcher.group(2));
}

Results:

prop1 => value1
prop2 => <![CDATA[String test='1234';int i=4;]]>
prop3 => value3

Upvotes: 1

Related Questions