SOUser
SOUser

Reputation: 3852

How to use positive regex lookahead to match, but exclude the lookahead part?

The lines to match against are

part1a_part1b__part1c_part1d_part3.extension
part1a_part1b__part1c_part1d__part3.extension
part1a_part1b__part1c_part1d_part2short_part3.extension
part1a_part1b__part1c_part1d_part2short__part3.extension
part1a_part1b__part1c_part1d_part2_part3.extension
part1a_part1b__part1c_part1d_part2__part3.extension
part1a_part1b__part1c_part1d_part2full_part3.extension
part1a_part1b__part1c_part1d_part2full__part3.extension
part1a_part1b__part1c_part1d_part2short-part3.extension
part1a_part1b__part1c_part1d_part2-part3.extension
part1a_part1b__part1c_part1d_part2full-part3.extension
part1a_part1b__part1c_part1d_part4.extension
part1a_part1b__part1c_part1d__part4.extension

The desired match should give exactly part1a_part1b__part1c_part1d for all the above lines except the last two lines. That is to say, the "stem" has an arbitrary number of part1, an optional part2 (in limited forms), and must ends with part3.extension.

Right now, I only got as far as

(?P<stem>[[:alnum:]_-]+)(?=(|part2short|part2|part2full))[_-]+part3\.extension

,by which the matched "stem" values for the lines above are

part1a_part1b__part1c_part1d
part1a_part1b__part1c_part1d_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2short_
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2_
part1a_part1b__part1c_part1d_part2full
part1a_part1b__part1c_part1d_part2full_
part1a_part1b__part1c_part1d_part2short
part1a_part1b__part1c_part1d_part2
part1a_part1b__part1c_part1d_part2full    

Could you help to comment how to match exactly part1a_part1b__part1c_part1d from all the above lines except the last two lines, if it is possible ?

Upvotes: 0

Views: 881

Answers (2)

anubhava
anubhava

Reputation: 784938

You may use this regex using a non-greedy match, a lookahead with an optional match:

(?m)^(?P<stem>[[:alnum:]_-]+?)(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$)

RegEx Demo

(?=(?:[_-]+part2(?:short|full)?)?[_-]+part3\.extension$) is a positive lookahead that asserts line ends with [-_]part3.extension with optional [-_]part2... string before.

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163207

You could match the first 4 parts with the text and the underscores and use a positive lookahead that asserts that the string ends with part3.extension:

^(?P<stem>[^_]+_[^_]+__[^_]+_[^_]+)(?=.*part3\.extension$)

That would match:

^                     # Begin of the string
(?P<stem>             # Named captured group stem
[^_]+_                # Match not _ one or more times, then _
[^_]+__               # Match not _ one or more times, then __
[^_]+_                # Match not _ one or more times, then _
[^_]+                 # # Match not _ one or more times
)                     # Close named capturing group
(?=                   # A positive lookahead that asserts what follows
  .*part3\.extension$ # Match part3.extension at the end of the string
)                     # Close lookahead

Upvotes: 1

Related Questions