Jonathan A. Reth
Jonathan A. Reth

Reputation: 55

Regex won't return a single character when surrounded by punctuation

Before I begin let me say I am new to regex, but today I have done extensive research and cannot find a solution to the following problem.

EDIT: I want to return just the numbers in all examples. But I want the punctuation excluded.

A single character string will not be returned if you surround it with punctuation and then choose not to return the punctuation.

Here's a basic example of this problem.

[^<].*[^>] on <12> returns 12
[^<].*[^>] on <1> returns nothing

If you only have punctuation on one side you are not returning then it works fine.

[^<].* on <1 returns 1
.*[^>] on 1> returns 1
[^<].*[^>] on <1> returns nothing

Here are regex's I have tried and their results.

[^<].*[^>] on <1> returns nothing
[^<][.]*[^>] on <1> returns nothing
[^<]+[^>] on <1> returns nothing
[^<][^\r\n]*[^>] on <1> returns nothing
[^<]\w*[^>] on <1> returns nothing
[^<]\d*[^>] on <1> returns nothing
[^<].?[^>] on <1> returns nothing
[^<][0-9]?[^>] on <1> returns nothing
[^<].*?[^>] on <1> returns nothing

Any help would be greatly appreciated.

Upvotes: 1

Views: 116

Answers (2)

Abdessabour Mtk
Abdessabour Mtk

Reputation: 3888

Although your regular expression works sometimes but it's wrong. let me first explain:

  • [^<] means any character that's not a less than sign <. the ^ means opposite when put in a character class ie between brackets [].
  • .* matches any character zero or more times. let's look at the how your regexes work:
  1. [^<].*[^>] with <12> :
    • [^<] can't match < thus it matches 1
    • .* matches 2
    • [^>] can't match > thus the regular expression engine backtracks to 2, now .* matches nothing.
  2. [^<].*[^>] with <1> :
    • [^<] can't match < thus it matches 1.
    • .* matches the >.
    • [^>] now the regular expression engine backtracks cuz to have a match it needs to match any character that's not < and it has already reached the end of the string. now .* matches nothing and the next character is > that's why the match fails.

What you meant to do is ^<(.*?)>, where:

  • ^ beginning of the string (you could omit this if you want to match any part of the string)
  • < match a less than sign.
  • .* match zero or more occurences of any characters. if you want to be more specific you could use and you'll only match digits \d or [0-9] in place of the period.
  • > matches a greater than sign.

the parentheses means capture these characters and are called a capture group in the regex jargon.

Another way to go about this is using lookaheads (?=) and lookbehinds (?<=) these are non capturing groups which would assert if the following (resp. preceding) characters validate the pattern given.

The regex would become (?<=<).*(?=>) which means match any character that's between <>

Upvotes: 2

Elton Clark
Elton Clark

Reputation: 156

The [^<] (any charachter not a "<") matches the 1 of 12, then .* matches nothing and [^>]] (any charachter not a ">") is matching the 2.

If you are looking to extract the digits between the < and >, your regex would look like <(.*)> that matches the whole set, but the parenthesis around the .* should be reported as a matched subgroup. Depending on the language you are using, you would need to use the library available to extract the subgroup match.

Upvotes: 0

Related Questions