Devabc
Devabc

Reputation: 5271

Regex: How to capture this? (a nested group inside a repeated group)

How can I solve this Java regex problem?

Input:

some heading text... ["fds afsa","fwr23423","42df f","1a_4(211@#","3240acg!g"] some trailing text....

Problem: I would like to capture everything between the double quotes. (Example: fds afsa, fwr23423, etc.)

I have tried the following pattern:

\[(?:"([^"]+)",?)+\]

But when performing a Matcher.find(), it will result in a StackOverflowError, when using a larger input (but does work for a small input, this is a bug in Java). And even if it did work, then matcher.group(1) will only give "3240acg!g".

How can I solve this issue? (Or is the use of multiple patterns required, where the first pattern strips the brackets?)

Upvotes: 6

Views: 2284

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

Three suggestions:

If strings only can occur between brackets, then you don't need to check for them at all and just use "[^"]*" as your regex and find all matches (assuming no escaped quotes).

If that doesn't work because strings could occur in other places too, where you don't want to capture them, do it in two steps.

  1. Match \[[^\]]*\].
  2. Find all occurrences of "[^"]*" within the result of the first match. Or even use a JSON parser to read that string.

Third possibility, cheating a bit:

Search for "[^"\[\]]*"(?=[^\[\]]*\]). That will match a string only if the next bracket that follows is a closing bracket. Limitation: No brackets are allowed inside the strings. I consider this ugly, especially if you look at how it would look like in Java:

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\"[^\"\\[\\]]*\"(?=[^\\[\\]]*\\])");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
} 

Do you think anybody who looks at this in a few months can tell what it's doing?

Upvotes: 1

Zernike
Zernike

Reputation: 1766

Get string between [ ] and then split by comma. It's much easier.

Upvotes: 1

Related Questions