Reputation: 3
Sample Input:
'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'
Expected Output:
['D3H6', 'X30G', 'Y2A', '12H89']
My code:
split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")
#if pattern.match(ini_str):
for a in n2:
if pattern.match(a):
alphaList.append(a)
I need to extract all the alpha numeric words from a split string and store them in a list.
The above code is unable to give expected output.
Upvotes: 0
Views: 75
Reputation: 411
Maybe this can solve the problem:
import re
# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])
#output: ['D3H6', 'X30', 'Y2', '12H89']
Upvotes: 1
Reputation: 1105
^
and $
are meant for the end and beginning of a line, not of a word.
Besides your example words don't include lower case, so why adding a-z
?
Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:
\b[0-9A-Z]+\d+\b
If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:
\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b
\b
stands for a word boundary.
Upvotes: 0