Reputation: 808
I'd like to think I'm pretty good at RegEx, but this one has me stumped. Search string looks like this...
ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~ISA*lots**of~other~data**with~~no terminating **pattern~
No line breaks.
ISA*
is a consistent starting pattern.
The rest of the string is completely unpredictable.
I need ISA*
and all characters until the next instance of that pattern.
A positive look-ahead, but this doesn't capture the last result.
(ISA\*(.*(?=ISA\*))?)
A positive look-behind, but I can't figure out how to make it lazy. If it's not lazy, there is only one match. But if it is lazy, you get the right number of matches, but only one additional character after the pattern.
ISA\*(?<=ISA\*).*?
The other solution is to programmatically split
or explode
the string, remove the first (empty) result, and then re-attach the delimiter to each result. Indeed, that is what I already have in place. But the size of the file, the large number of results, and the post-processing causing performance issues. In a preliminary test, using regex appears to offer some worthy performance gains.
This is being processed with PHP. The string is sourced from an AS400 system, in an "EDI Transaction" text file. I have yet to find any libraries that contain a working regex for this type of file.
Upvotes: 1
Views: 872
Reputation: 18490
How about using preg_split
$res = preg_split('/\b(?!^)(?=ISA\*)/', $str);
\b(?!^)
split at any word boundary but not at start(?=ISA\*)
if followed by the specified substringSee php demo at eval.in or regex demo at regex101
If ~
before ISA
is predictable, use (?<=~)
instead of \b(?!^)
.
Upvotes: 1
Reputation: 26917
You could also try expressing what to capture instead:
ISA\*(?:[^I]|I[^S]|IS[^A]|ISA[^*])+
using preg_match_all
Upvotes: 0
Reputation: 626845
Use the following lookahead based regex:
ISA\*(.*?)(?=ISA\*|$)
See the regex demo
Details:
ISA\*
- a literal ISA*
substring(.*?)
- Group 1 capturing any 0+ chars other than line break chars as few as possible (due to the lazy *?
quantifier) up to (but excluding from the match)...(?=ISA\*|$)
- either ISA*
or end of string (since it is a lookahead, the text matched with the pattern is not put into the returned match value).Another variation of the same regex is
ISA\*((?:(?!ISA\*).)*)
See the regex demo. Unrolled version (the most efficient):
ISA\*([^I]*(?:I(?!SA\*)[^I]*)*)
See this regex demo.
Upvotes: 0