Reputation: 318
I'm trying to capture specific pattern from large text document. This pattern is quite simple - if the line begins with a word and end with the same word, I want to capture that line. For example:
phase1 begin trial end phase1
phase2.begin distribution end phase2
phase3 allow buying in phase3 but
phase4 has no end
phase5 is test of phase
In this document I would expect to get match on line 1 and line 2, since both lines begin and and with the same word [a-zA-Z0-9], line 3 should not be matched because it does not end with the same word (although it has the same word in the string), line 4 and 5 does not even have the first word in the line at all. I tried using pattern:
^([a-zA-Z0-9]*\b)(.+)(\b\1)$
It should have forced string to end after backreference, but instead it matched on on all five lines (does not match groups but has a full match for each line). I think I am missing some fundamental understanding of regex since I cannot understand how to force it to match this specific pattern, it would be helpful if someone could explain me the flaw in my thinking.
I have tried to look for this pattern but mostly people try to match known words, the complication here is that I want to match any line as long as it starts with arbitrary word and ends with it (as in example there might be N number of phases or any other arbitrary word written in the document). I am using regex101 to test my pattern match.
Upvotes: 0
Views: 226
Reputation: 163207
The reason it matches the whole string, is that there is a word boundary between the first b
and the start of the string.
What happens is that the regex will backtrack until it can fit the backreference (an empty string) at the end of the string and capture group 2 will contain the whole string as you can see in the right panel with the matches.
The (.+)
expects to match at least 1 character and the \1
at the end refers to what is captured in group 1, which is an empty string.
To only match the first 2 strings, you can make the character class match at least one or more characters [a-zA-Z0-9]+
Upvotes: 1