Reputation: 97
I have these kinds of strings in my document
<year of publication: 2007>
<कुम्भ-कारस्य>U
My aim is to bookmark the first kind of lines from my document.
I use the following regex expression (?<=\<)([a-z]+?)(?<=\>)
This however ends up selecting the U in the second string too. What can I fix? This is on Notepad++
Upvotes: 1
Views: 70
Reputation: 110163
Using a very strict understanding of your question <
followed by one or more non-brackets up until a year
followed by a >
, I would suggest something like:
For something more complex or a better understanding, check out Wiktor's answer.
Upvotes: 0
Reputation: 626804
The problem you have is due to the fact that you are using \<
and \>
, leading and trailing word boundaries, rather than <
and >
, angle bracket patterns. The U
letter is matched because
\<
and \>
are word boundaries and U
is preceded with >
and not followed with a word char, i.e. it is enclosed with word boundary positions.If you want to match lines with a pattern like <[LOWERCASE_WORD](optional lowercase words)?: [DIGITS]>
you can use
(?-i)<[a-z]+(?:\h+[a-z]+)*:\h*\d+>
See the regex demo.
Details
(?-i)
- the pattern is now case sensitive<
- a <
char[a-z]+
- one or more ASCII letters(?:\h+[a-z]+)*
- zero or more repetitions of one or more horizontal whitespace and one or more lowercase ASCII letters:
- a colon\h*
- one or more horizontal whitespace\d+
- one or more digits>
- a >
char.Bonus
These are some variations for you in case your pattern does not have to be that strict:
<[a-z\h]*:[^<>]*>
(demo) - matches <
, then zero or more spaces and letters (respecting your Match Case setting), :
, then any zero or more chars other than <
and >
and then a >
<[a-z][^>]*>
(demo) - matches <
, an ASCII letter (respecting your Match Case setting), then any zero or more chars other than >
and then a >
.Upvotes: 0
Reputation: 521194
Trivially, you need to use a positive lookahead, not lookbehind, on the right side of your regex pattern:
(?<=<)([a-z]+?)(?=>)
^^^ change is here
Note that if you more generally just want to match tag contents, you could use:
<(.*?)>
And then access the first capture group.
Upvotes: 1