Reputation: 2378
I am trying to write a regular expression that can extract different types of string+number+symbol combinations out of a string. The types of strings I am trying to extract are:
avs-tldr-02
cc+asede
x86_64
The types of edge cases I am testing are these string appearing at the beginning, middle and end of sentences:
avs-tldr-02 this is a test
cc+asede this is a test
x86_64 this is a test
this is a test avs-tldr-02 this is a test
this is a test cc+asede this is a test
this is a test x86_64 this is a test
this is a test avs-tldr-02
this is a test cc+asede
this is a test x86_64
Based on this excellent answer, I have dabbled around with "lookaround" assertions in RegEx and have come up with the following:
(?=.*[:alnum:])(?=.*[:punct:])([a-zA-Z0-9_-]+)
However, this keeps matching the first word of the string - I understand why this is happening, but am at a loss of how to tweak this to work for my use case.
How do I modify this to get what I am looking for/are there any other ways to tackle this issue?
Upvotes: 2
Views: 1119
Reputation: 2855
I used this regex
/([^\s]+?[-_+][^\s]+)/g
I'm not familiar with r but tested regex is good looking! https://regex101.com/r/Sxully/1
Note: in implementing given regex in "" or '' , be careful about backslash and \\
that depends on language and usage
if you want accept '_word_starting_by_underline' use this regex: (it wont be useful :) )
/([^\s]*?[-_+][^\s]+)/g
// ^^^^ + changed to * to support nothing before [-_+]
Upvotes: 3
Reputation: 626738
Your pattern has several issues. POSIX character classes like [:alnum:]
or [:punct:]
must be located inside bracket expressions to be parsed as such. Another thing is that the .*
matches any char (other than line break char in a PCRE regex), and thus will cause overmatching as it will return true if the lookahead pattern is found much farther in the string than you expect.
I suggest using
(?=[[:punct:]]*[[:alnum:]])(?=[[:alnum:]]*[[:punct:]])[[:alnum:][:punct:]]+
See the regex demo
Details:
(?=[[:punct:]]*[[:alnum:]])
- at the current position, there must be 0+ punctuation symbols followed with an alphanumeric char (?=[[:alnum:]]*[[:punct:]])
- at the current position (same as above, lookaheads are zero-width assertions that do not consume text), there must be 0+ alphanumeric chars followed with a punctuation symbol [[:alnum:][:punct:]]+
- 1 or more alphanumeric or punctuation chars.You may add a word boundary (\b
) on both ends if you requires an alphanumeric char at start/end of the match.
Upvotes: 2