Reputation: 1605
I would like to split a string into quoted and unquoted parts, in which escaped quotes are ignored. For example, the following input:
String input = "Example with \"quoted \\\"test\\\" region\" embedded";
Should result in the following list:
String[] result = ["Example with", "\"quoted \\\"test\\\" region\"", "embedded"];
For splitting quoted regions (while ignoring escaped quotes) I use:
public static final String QUOTE_PATTERN = "(?<!\\\\)\".*?(?<!\\\\)\"";
String input = "Example with \"quoted \\\"test\\\" region\" embedded";
String[] result = input.split(QUOTE_PATTERN);
System.out.println(Arrays.toString(result));
Which provides the expected output [Example with , embedded]
. However, I would very much like to have the delimiters (the quoted regions) in this list as well. (Of course, I can achieve this by getting the start stop indices using a Matcher, but that still requires a lot of extra code.)
I found a solution to split a string including the delimiters by using a lookahead and lookbehind which can successfully split a colon-separated string into a list that also contains the colons:
public static final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public static final String COLON_PATTERN = String.format(WITH_DELIMITER, ":");
String colonTest = "Part0:Part1:Part2";
String[] parts = colonTest.split(COLON_PATTERN);
System.out.println(Arrays.toString(parts));
This provides the following output: [Part0, :, Part1, :, Part2]
.
However, it seems that this cannot be applied to delimiters with a variable length, because:
public static final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public static final String QUOTE_PATTERN =
String.format(WITH_DELIMITER, "(?<!\\\\)\".*?(?<!\\\\)\"");
String input = "Example with \"quoted \\\"test\\\" region\" embedded";
String[] result = input.split(QUOTE_PATTERN);
System.out.println(Arrays.toString(result));
throws the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 23
((?<=(?<!\\)".*?(?<!\\)")|(?=(?<!\\)".*?(?<!\\)"))
^
Does anyone know if something similar is possible for variable-width delimiters?
Thanks!
Upvotes: 1
Views: 729
Reputation: 626893
Since your strings are not longer than 200 symbols long, you can make use of the Java constrained-width look-behind, i.e. Java's look-behind supports {0,200}
quantifiers (where min and max lengths are specified).
✽ Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance,
(?<=cats?)
is valid because it can only match strings of three or four characters. Likewise,(?<=A{1,10})
is valid.
Thus, you can leverage this code:
String.format(WITH_DELIMITER, "(?<!\\\\)\".{0,200}(?<!\\\\)\"");
^^^^^^^
See IDEONE demo
String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
String QUOTE_PATTERN =
String.format(WITH_DELIMITER, "(?<!\\\\)\".{0,200}(?<!\\\\)\"");
String input = "Example with \"quoted \\\"test\\\" region\" embedded";
String[] result = input.split(QUOTE_PATTERN);
System.out.println(Arrays.toString(result));
Output:
[Example with , "quoted \"test\" region", embedded]
Upvotes: 1
Reputation: 38919
Look aheads/behinds are slower but tend to be more easily understandable. You can accomplish the same thing if you take the time to understand a simple optional capture:
(?:[^\\"]|\\.)*
The first option says to match anything that is not a backslash or quotation mark character. The second option says to match any escaped character.
When used with the asterisks this will capture anything up to a non-escaped quotation mark.
Now lets use that information in a regex:
((?:[^\\"]|\\.)*)("(?:[^\\"]|\\.)*")
This will first capture your preceding string and then capture your quoted string with the quotation marks.
If you want to just capture everything else up to the end of the line you could add a (.*)
to the end of the regex. But you could also just expand that regex to deal with more than one quoted string on a line.
Upvotes: 0