stratosgear
stratosgear

Reputation: 962

How do I capture the text that is before and after a multiple regex matches in java?

Given a test string of:

I have a 1234 and a 2345 and maybe a 3456 id.

I would like to match all the IDs (the four digit numbers) AND at the same time get 12 characters of their surrounding text (before and after) (if any!)

So the matches should be:

             BEFORE       MATCH      AFTER
Match #1:   I have a-      1234    -and a 2345-
Match #2:   -1234 and a-   2345    -and maybe a
Match #3:   and maybe a-   3456    -id.

This (-) is a space character

Note:

The BEFORE match of Match #1 is not 12 characters long (not many characters at the beginning of the string). Same with the AFTER match of Match #3 (not many characters after the last match)

Can I achieve these matches with a single regex in java?

My best attempt so far is to use a positive look behind and an atomic group (to get the surrounding text) but it fails in the beginning and the end of the string when there are not enough characters (like my note above)

(?<=(.{12}))(\d{4})(?>(.{12}))

This matches only 2345. If I use a small enough value for the quantifiers (2 instead of 12, for example) then I correctly match all IDs.

Here is a link to my regex playground where I was trying my regex's:

http://regex101.com/r/cZ6wG4

Upvotes: 1

Views: 156

Answers (3)

Alan Moore
Alan Moore

Reputation: 75222

You don't need a lookbehind or an atomic group for this, but you do need a lookahead:

(.{0,12}?)\b(\d+)\b(?=(.{0,12}))

I'm assuming your ID's are not enclosed in longer words (thus the \b). I used a reluctant quantifier in the leading portion ({0,12}?) to prevent it consuming more than one ID when they're spaced close to each other, and in:

I have a 1234, 2345 and 1456 id.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

You can do it in a single regex:

Pattern regex = Pattern.compile("(?<=^.{0,10000}?(.{0,12}))(\\d+)(?=(.{0,12}))");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    before = regexMatcher.group(1);
    match = regexMatcher.group(2);
    after = regexMatcher.group(3);
} 

Explanation:

(?<=          # Assert that the following can be matched before current position
 ^.{0,10000}? # Match as few characters as possible from the start of the string
 (.{0,12})    # Match and capture up to 12 chars in group 1
)             # End of lookbehind
(\d+)         # Match and capture in group 2: Any number
(?=           # Assert that the following can be matched here:
 (.*)         # Match and capture up to 12 chars in group 3
)             # End of lookahead

Upvotes: 2

x4rf41
x4rf41

Reputation: 5337

When you look at the MatchResult (http://docs.oracle.com/javase/7/docs/api/java/util/regex/MatchResult.html) interface implemented by the Matcher class (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html) you will find the functions start() and end() which give you the index of the first / last character of the match within the input string. Once you have the indicies, you can use some simple math and the substring function to extract the parts you want.

I hope this helps you, because I won't write the entire code for you.

There might be a possibility to do what you want purely with regex. But I think using the indicies and substring is easier (and probably more reliable)

Upvotes: 3

Related Questions