Reputation: 719
Let's say I have the string:
String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";
I want the tokens:
For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:
String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.
I've tried escaping the non-delimiter:
String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";
But then there is an ugly mess of removing the escape characters.
Do you have any suggestions?
Upvotes: 1
Views: 98
Reputation: 9766
Prerequisite:
All your tokens start with prop
There is no prop
in the file other than the beginning of a token
I'd just do a replace of all ;prop
by ~prop
Then your string becomes:
"prop1=value1~prop2=String test='1234';int i=4~prop3=value3";
You can then tokenize using the ~
delimiter
Upvotes: 0
Reputation: 627488
You may match either <![CDATA...]]>
or any char other than ;
, 1 or more times, to match the values. To match the keys, you may use a regular \w+
pattern:
(\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)
See the regex demo.
Details
(\w+)
- Group 1: one or more word chars=
- a =
sign((?:<!\[CDATA\[.*?]]>|[^;])+)
- Group 1: one or more sequences of
<!\[CDATA\[.*?]]>
- a <![CDATA[...]]>
substring|
- or[^;]
- any char but ;
See a Java demo:
String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + " => " + matcher.group(2));
}
Results:
prop1 => value1
prop2 => <![CDATA[String test='1234';int i=4;]]>
prop3 => value3
Upvotes: 1