aybe
aybe

Reputation: 16652

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)

((-|\+)?\w+(\^\.?\d+)?)

hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121

It works well except that :

The strings to capture are as follows :

A few examples of valid strings to capture :

EDIT

I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :

([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)

Upvotes: 2

Views: 120

Answers (2)

Jerry
Jerry

Reputation: 71538

Ok, there are some issues to tackle here:

((-|+)?\w+(\^.?\d+)?)
    ^        ^

The + and . should be escaped like this:

((-|\+)?\w+(\^\.?\d+)?)

Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:

((-|\+)?[a-zA-Z]+(\^\.?\d+)?)

\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.

And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:

(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)

EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:

(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))

And lastly, if capturing the words is sufficient, we can remove the lookarounds:

([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)

Upvotes: 2

Rob Raisch
Rob Raisch

Reputation: 17357

It would be better if you first state what it is you are looking to extract.

You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...

Assuming you want to capture only:

  • words that have a leading + or -
  • words that have a trailing ^ followed by an optional period followed by one or more digits

and that words are sequences of one or more letters

I'd use:

([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)

which breaks down into:

(              # start capture group
    [a-zA-Z]+    # one or more letters - note \w matches numbers and underscores
    \^           # literal
    \.?          # optional period
    \d+          # one or more digits
|              # OR
    [+-]?        # optional plus or minus
    [a-zA-Z]+    # one or more letters or underscores
)              # end of capture group

EDIT

To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:

([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)

which breaks down into:

(              # start capture group
    [+-]         # literal plus or minus
    [a-zA-Z]+    # one or more letters - note \w matches numbers and underscores
|              # OR
    [a-zA-Z]+    # one or more letters
    \^           # literal
    (?:          # start of non-capturing group
      \.           # literal period
      \d+          # one or more digits
    |            # OR
      \d+          # one or more digits       
      \.           # literal period
      \d+          # one or more digits
    |            # OR
      \d+          # one or more digits 
    )            # end of non-capturing group
|              # OR
    [a-zA-Z]+    # one or more letters
)              # end of capture group

Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)

FURTHER EDIT

This regexp will only match the following patterns delimited by commas:

  • word
  • word with leading plus or minus
  • word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+

    ([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)

Please note that the useful match will appear in the first capture group, not the entire match.

So, in Javascript, you'd:

var src="hello ,  hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
    RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;

while(RE.test(src)){
    console.log(RegExp.$1)
}

which produces:

hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1

Upvotes: 1

Related Questions