Reputation: 683
I am trying to extract words from a sentence using regular expression with following condition
For example i am trying on the sentence
>> text = 'This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$'
>> re.findall(r'[\w(\S)\d]{2,}', text) # this re pattern i applied
But the result is
>> ['This',
'product',
'labelled',
'as',
'4KJ2',
'with',
'manufactured',
'in',
'P&G',
'in',
'the',
'year',
'1990',
'symbols',
'$$$,',
'$J2,',
'2J$']
In the above output
$$$, $J2, 2J$
are not desired words in the output. I have tried below re pattern also but didn't work
>> re.findall(r'^[a-zA-Z0-9][\S]*[a-zA-Z0-9]$', text)
>> [] # empty ouput
-Thanks
Upvotes: 0
Views: 1476
Reputation: 2576
You could split the string before using a regex with "some text".split()
to ease up things e.g.
text = 'This product labelled as 4KJ2 with manufactured in P&G '\
'in the year 1990 : symbols $$$, $J2, 2J$ $abc def g'
[x for x in text.split() if re.match("^\w(?:.*\w)?$", x)]
Here is a quite compact regex, but it doesn't work for a valid string at the end of a sentence:
re.findall(r'\b(\w\S*?\w)(?=\s)', text)
This one will work, but i guess using the list comprehension will work much better.
re.findall(r'(?:\s|\A)(\w(?:\S*\w)?)(?=\s|$|[.,:;?!])', text)
\b
matches a word boundary
(?=\s)
is a positive lookahead, which will check for a space to follow the match without including it.
As @Toto just pointed out \w
will match word characters including digits.
Here is my example on regex101.com as well.
Upvotes: 1
Reputation: 43169
A non-regex solution could be
text = "This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$"
def tester(word):
if word[:1].isalnum() and word[-1].isalnum():
return True
return False
words = [word for word in text.split() if tester(word)]
print(words)
This yields
['This', 'product', 'labelled', 'as', '4KJ2', 'with', 'manufactured', 'in', 'P&G', 'in', 'the', 'year', '1990', 'symbols']
Upvotes: 3
Reputation: 1417
Try the following
(?<=[\s^])[A-Za-z0-9]+
At least it works on regex101.com.
It should look behind for either white space or beginning of a sentence and match letters or numbers explicitly.
It should work with \w, too, but it's hard to use that site on a phone.
Upvotes: 0
Reputation: 163372
You could use
(?<!\S)[a-zA-Z0-9]\S*[a-zA-Z0-9](?!\S)
Note that the minimum length is 2 characters.
If you also want to match a single char, you could use an optional non capturing group after matching the first char:
(?<!\S)[a-zA-Z0-9](?:\S*[a-zA-Z0-9])?(?!\S)
For example
import re
text = 'This product labelled as 4KJ2 with manufactured in P&G in the year 1990 : symbols $$$, $J2, 2J$'
result = re.findall(r'(?<!\S)[a-zA-Z0-9]\S*[a-zA-Z0-9](?!\S)', text)
print(result)
Output
['This', 'product', 'labelled', 'as', '4KJ2', 'with', 'manufactured', 'in', 'P&G', 'in', 'the', 'year', '1990', 'symbols']
Upvotes: 4