sriramn
sriramn

Reputation: 2378

RegEx to extract alphanumeric+symbol combinations from a string

I am trying to write a regular expression that can extract different types of string+number+symbol combinations out of a string. The types of strings I am trying to extract are:

avs-tldr-02
cc+asede
x86_64

The types of edge cases I am testing are these string appearing at the beginning, middle and end of sentences:

avs-tldr-02 this is a test
cc+asede this is a test
x86_64 this is a test

this is a test avs-tldr-02 this is a test
this is a test cc+asede this is a test
this is a test x86_64 this is a test

this is a test avs-tldr-02
this is a test cc+asede
this is a test x86_64

Based on this excellent answer, I have dabbled around with "lookaround" assertions in RegEx and have come up with the following:

(?=.*[:alnum:])(?=.*[:punct:])([a-zA-Z0-9_-]+)

However, this keeps matching the first word of the string - I understand why this is happening, but am at a loss of how to tweak this to work for my use case.

How do I modify this to get what I am looking for/are there any other ways to tackle this issue?

Upvotes: 2

Views: 1119

Answers (2)

MohaMad
MohaMad

Reputation: 2855

I used this regex

/([^\s]+?[-_+][^\s]+)/g

I'm not familiar with r but tested regex is good looking! https://regex101.com/r/Sxully/1

Note: in implementing given regex in "" or '' , be careful about backslash and \\ that depends on language and usage

if you want accept '_word_starting_by_underline' use this regex: (it wont be useful :) )

/([^\s]*?[-_+][^\s]+)/g
//    ^^^^ + changed to * to support nothing before [-_+]

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

Your pattern has several issues. POSIX character classes like [:alnum:] or [:punct:] must be located inside bracket expressions to be parsed as such. Another thing is that the .* matches any char (other than line break char in a PCRE regex), and thus will cause overmatching as it will return true if the lookahead pattern is found much farther in the string than you expect.

I suggest using

(?=[[:punct:]]*[[:alnum:]])(?=[[:alnum:]]*[[:punct:]])[[:alnum:][:punct:]]+

See the regex demo

Details:

  • (?=[[:punct:]]*[[:alnum:]]) - at the current position, there must be 0+ punctuation symbols followed with an alphanumeric char
  • (?=[[:alnum:]]*[[:punct:]]) - at the current position (same as above, lookaheads are zero-width assertions that do not consume text), there must be 0+ alphanumeric chars followed with a punctuation symbol
  • [[:alnum:][:punct:]]+ - 1 or more alphanumeric or punctuation chars.

You may add a word boundary (\b) on both ends if you requires an alphanumeric char at start/end of the match.

Upvotes: 2

Related Questions