TLanni
TLanni

Reputation: 340

Extract the string from the document using regex in python

I need to extract a string from a document with the following regex pattern in python. string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order) This string pattern can contain anywhere in the document

document_text="""
This is the organization..this is the address. 
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""

I have written following code.

pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)

but I am getting the list contains AKs and BKs something like this

res_list=['AKBN','BKCPU','AK3418CPMP']

when I just use

res_grp=re.search(pattern,document_text)
res=res_grp.group(1)

I just get 'AKBN'

it is also matching the words "AKBN", "BKCPU" along with the required "AK3418CPMP" when I use findall. I want conditions to be following to extract only 1 string "AK3418CPMP":

  1. string should start with AK or BK
  2. It should followed by letters and numbers or numbers and letters
  3. It can contain "-" or "/"

How can I only extract "AK3418CPMP"

Upvotes: 1

Views: 451

Answers (3)

MarcZ
MarcZ

Reputation: 83

You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.

Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :

([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})

Upvotes: 0

mama
mama

Reputation: 2227

You can keep your regex, and make python do the filtering.

 import re                                                                  
                                                                            
 document_text="""    
 This is the organization..this is the address.    
 AKBN    
 some information    
 AK3418CPMP    
 lot of other information down    
 BKCPU    
 """    
     
 pattern="(?:AK|BK)[A-Za-z0-9-/]+"    
 res_list=[x for x in    
         re.findall(pattern,document_text)    
         if re.search(r'\d', x)    
         and re.search(r'\w', x)]    
     
 print(res_list)

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163642

You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.

\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
  • \b A word boundary to prevent a partial match
  • [AB]K Match either AK or BK
  • [A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
  • [0-9] Match at least a single digit
  • [A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit

Regex demo

Upvotes: 1

Related Questions