user2602807
user2602807

Reputation: 1292

How to regExp 'zero or one' groups which contain '.*'

I'm trying to get record1, record2, record3 from text:

"Record1 ANY TEXT 123 4 5 Record2 ANOTHER TEXT 90-8098 Record3 MORE TEXT ASD 123"

Each record appears ONE or ZERO times. I use pattern:

(Record1.*)?(Record2.*)?(Record3.*)?

If each record appears,

matcher.group(1) == "Record1 ANY TEXT 123 4 5 Record2 ANOTHER TEXT 90-8098 Record3 MORE TEXT ASD 123"
matcher.group(2) == null
matcher.group(3) == null

If I use pattern:

(Record1.*)(Record2.*)(Record3.*)

matcher.group(1) == "Record1 ANY TEXT 123 4 5 "
matcher.group(2) == "Record2 ANOTHER TEXT 90-8098 "
matcher.group(3) == "Record3 MORE TEXT ASD 123"

It's exatly what I want, but each record can appear zero time and this regexp not suitable

What pattern should I use?

Upvotes: 7

Views: 8316

Answers (2)

user557597
user557597

Reputation:

If your text is tightly packed and is composed of just Record, why not use split
(if Java calls it split).

split regex:

 #  "(?:(?!Record)[\\S\\s])*(Record[\\S\\s]*?)(?=Record|$(?!\\n))"


 (?:
      (?! Record )
      [\S\s] 
 )*
 ( Record [\S\s]*? )
 (?=
      Record
   |  $ (?! \n )
 )

Upvotes: 0

Andrew Cheong
Andrew Cheong

Reputation: 30273

You want to make your quantifiers non-greedy, and you want to use anchors:

^.*?(Record1.*?)?(Record2.*?)?(Record3.*?)?$

In your original expression, your .* was basically consuming everything to the end of the string, because that's how regular expressions behave, by default (called greedy matching). Since the second and third groups were optional, there was no reason for the engine not to simply match everything with that first .*—it was the most efficient match.

By adding a ? after any quantifier, e.g. *? or +? or ?? or {m,n}?, you instruct the engine to match as little as possible, i.e. invoke non-greedy matching.

So, why the anchors? Well, if you invoke non-greedy matching, the engine's going to try to match as little as possible. So, it'd match nothing, since all your groups are optional! By forcing the whole expression to match the beginning, ^, as well as the end, $, you force to regular expression to find some way to match as few characters as possible via .*?, but still match as much as needed to get all the details.

Upvotes: 8

Related Questions