undefined
undefined

Reputation: 808

Regex from Begining Pattern (Inclusive) to Next Beginning Pattern (Exclusive)

I'd like to think I'm pretty good at RegEx, but this one has me stumped. Search string looks like this...

ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~

What I've Tried

A positive look-ahead, but this doesn't capture the last result. (ISA\*(.*(?=ISA\*))?)

A positive look-behind, but I can't figure out how to make it lazy. If it's not lazy, there is only one match. But if it is lazy, you get the right number of matches, but only one additional character after the pattern. ISA\*(?<=ISA\*).*?

The other solution is to programmatically split or explode the string, remove the first (empty) result, and then re-attach the delimiter to each result. Indeed, that is what I already have in place. But the size of the file, the large number of results, and the post-processing causing performance issues. In a preliminary test, using regex appears to offer some worthy performance gains.


This is being processed with PHP. The string is sourced from an AS400 system, in an "EDI Transaction" text file. I have yet to find any libraries that contain a working regex for this type of file.

Upvotes: 1

Views: 872

Answers (3)

bobble bubble
bobble bubble

Reputation: 18490

How about using preg_split

$res = preg_split('/\b(?!^)(?=ISA\*)/', $str);

See php demo at eval.in or regex demo at regex101

If ~ before ISA is predictable, use (?<=~) instead of \b(?!^).

Upvotes: 1

NetMage
NetMage

Reputation: 26917

You could also try expressing what to capture instead:

ISA\*(?:[^I]|I[^S]|IS[^A]|ISA[^*])+

using preg_match_all

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

Use the following lookahead based regex:

ISA\*(.*?)(?=ISA\*|$)

See the regex demo

Details:

  • ISA\* - a literal ISA* substring
  • (.*?) - Group 1 capturing any 0+ chars other than line break chars as few as possible (due to the lazy *? quantifier) up to (but excluding from the match)...
  • (?=ISA\*|$) - either ISA* or end of string (since it is a lookahead, the text matched with the pattern is not put into the returned match value).

Another variation of the same regex is

ISA\*((?:(?!ISA\*).)*)

See the regex demo. Unrolled version (the most efficient):

ISA\*([^I]*(?:I(?!SA\*)[^I]*)*)

See this regex demo.

Upvotes: 0

Related Questions