Adideva98
Adideva98

Reputation: 97

Regex confusion on Notepad++

I have these kinds of strings in my document

<year of publication: 2007>
<कुम्भ-कारस्य>U

My aim is to bookmark the first kind of lines from my document. I use the following regex expression (?<=\<)([a-z]+?)(?<=\>) This however ends up selecting the U in the second string too. What can I fix? This is on Notepad++

Upvotes: 1

Views: 70

Answers (3)

David542
David542

Reputation: 110163

Using a very strict understanding of your question < followed by one or more non-brackets up until a year followed by a >, I would suggest something like:

enter image description here

For something more complex or a better understanding, check out Wiktor's answer.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

The problem you have is due to the fact that you are using \< and \>, leading and trailing word boundaries, rather than < and >, angle bracket patterns. The U letter is matched because

  • You have Match Case option checked OFF and
  • The \< and \> are word boundaries and U is preceded with > and not followed with a word char, i.e. it is enclosed with word boundary positions.

If you want to match lines with a pattern like <[LOWERCASE_WORD](optional lowercase words)?: [DIGITS]> you can use

(?-i)<[a-z]+(?:\h+[a-z]+)*:\h*\d+>

See the regex demo.

Details

  • (?-i) - the pattern is now case sensitive
  • < - a < char
  • [a-z]+ - one or more ASCII letters
  • (?:\h+[a-z]+)* - zero or more repetitions of one or more horizontal whitespace and one or more lowercase ASCII letters
  • : - a colon
  • \h* - one or more horizontal whitespace
  • \d+ - one or more digits
  • > - a > char.

Bonus

These are some variations for you in case your pattern does not have to be that strict:

  • <[a-z\h]*:[^<>]*> (demo) - matches <, then zero or more spaces and letters (respecting your Match Case setting), :, then any zero or more chars other than < and > and then a >
  • <[a-z][^>]*> (demo) - matches <, an ASCII letter (respecting your Match Case setting), then any zero or more chars other than > and then a >.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521194

Trivially, you need to use a positive lookahead, not lookbehind, on the right side of your regex pattern:

(?<=<)([a-z]+?)(?=>)
                ^^^ change is here

Note that if you more generally just want to match tag contents, you could use:

<(.*?)>

And then access the first capture group.

Upvotes: 1

Related Questions