BP09
BP09

Reputation: 3

Extract words from a string

Sample Input:

'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'

Expected Output:

['D3H6', 'X30G', 'Y2A', '12H89']

My code:

split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")  
#if pattern.match(ini_str):
for a in n2:
        if pattern.match(a):
            alphaList.append(a)

I need to extract all the alpha numeric words from a split string and store them in a list.

The above code is unable to give expected output.

Upvotes: 0

Views: 75

Answers (2)

Mazziotti Raffaele
Mazziotti Raffaele

Reputation: 411

Maybe this can solve the problem:

import re 

# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])

#output: ['D3H6', 'X30', 'Y2', '12H89']

Upvotes: 1

hyamanieu
hyamanieu

Reputation: 1105

^ and $ are meant for the end and beginning of a line, not of a word. Besides your example words don't include lower case, so why adding a-z?

Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:

\b[0-9A-Z]+\d+\b

If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:

\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b

\b stands for a word boundary.

Upvotes: 0

Related Questions