Bhaskar
Bhaskar

Reputation: 683

Get only words start and end with either letters or digits using regular expression

I am trying to extract words from a sentence using regular expression with following condition

For example i am trying on the sentence

>> text = 'This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$'
>> re.findall(r'[\w(\S)\d]{2,}', text)   # this re pattern i applied 

But the result is

>> ['This',
    'product',
    'labelled',
    'as',
    '4KJ2',
    'with',
    'manufactured',
    'in',
    'P&G',
    'in',
    'the',
    'year',
    '1990',
    'symbols',
    '$$$,',
    '$J2,',
    '2J$']

In the above output

$$$, $J2, 2J$

are not desired words in the output. I have tried below re pattern also but didn't work

>> re.findall(r'^[a-zA-Z0-9][\S]*[a-zA-Z0-9]$', text)
>> [] # empty ouput

-Thanks

Upvotes: 0

Views: 1476

Answers (4)

Night Train
Night Train

Reputation: 2576

You could split the string before using a regex with "some text".split() to ease up things e.g.

text = 'This product labelled as 4KJ2 with manufactured in P&G '\
       'in the year 1990 : symbols $$$, $J2, 2J$ $abc def g'

[x for x in text.split() if re.match("^\w(?:.*\w)?$", x)]

Here is a quite compact regex, but it doesn't work for a valid string at the end of a sentence:

re.findall(r'\b(\w\S*?\w)(?=\s)', text)

This one will work, but i guess using the list comprehension will work much better.

re.findall(r'(?:\s|\A)(\w(?:\S*\w)?)(?=\s|$|[.,:;?!])', text)

\b matches a word boundary

(?=\s) is a positive lookahead, which will check for a space to follow the match without including it.

As @Toto just pointed out \w will match word characters including digits.

Here is my example on regex101.com as well.

Upvotes: 1

Jan
Jan

Reputation: 43169

A non-regex solution could be

text = "This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$"

def tester(word):
    if word[:1].isalnum() and word[-1].isalnum():
        return True
    return False

words = [word for word in text.split() if tester(word)]
print(words)

This yields

['This', 'product', 'labelled', 'as', '4KJ2', 'with', 'manufactured', 'in', 'P&G', 'in', 'the', 'year', '1990', 'symbols']

Upvotes: 3

edd
edd

Reputation: 1417

Try the following

(?<=[\s^])[A-Za-z0-9]+

At least it works on regex101.com.

It should look behind for either white space or beginning of a sentence and match letters or numbers explicitly.
It should work with \w, too, but it's hard to use that site on a phone.

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163372

You could use

(?<!\S)[a-zA-Z0-9]\S*[a-zA-Z0-9](?!\S)

Regex demo | Python demo

Note that the minimum length is 2 characters.

If you also want to match a single char, you could use an optional non capturing group after matching the first char:

(?<!\S)[a-zA-Z0-9](?:\S*[a-zA-Z0-9])?(?!\S)

Regex demo

For example

import re

text = 'This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$'
result = re.findall(r'(?<!\S)[a-zA-Z0-9]\S*[a-zA-Z0-9](?!\S)', text)
print(result)

Output

['This', 'product', 'labelled', 'as', '4KJ2', 'with', 'manufactured', 'in', 'P&G', 'in', 'the', 'year', '1990', 'symbols']

Upvotes: 4

Related Questions