Reputation: 340
I need to extract a string from a document with the following regex pattern in python. string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order) This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall
.
I want conditions to be following to extract only 1 string "AK3418CPMP":
How can I only extract "AK3418CPMP"
Upvotes: 1
Views: 451
Reputation: 83
You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,})
. This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})
Upvotes: 0
Reputation: 2227
You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)
Upvotes: 1
Reputation: 163642
You can make sure to match at least a single digit after matching AK or BK and move the -
to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b
A word boundary to prevent a partial match[AB]K
Match either AK or BK[A-Za-z/-]*
Optionally repeat matching chars A-Za-z /
or -
without a digit[0-9]
Match at least a single digit[A-Za-z0-9/-]*
Optionally match what is listed in the character class including the digitUpvotes: 1