ManojK
ManojK

Reputation: 1640

How to extract a text from newline followed by some specific keywords in Python?

I am working on a problem where I have some Multi line strings which are in a table type email snapshot format.

Example below:

Hello,

please provide an update on the following invoice

Invoice#        Status    Invoice_Amount        Account#
646464646       Open      7446.00               53334444
645543333       Open      6443.00               23599499
874646553       Open      6223.50               94744663

Thanks,

My task is to extract the Invoice numbers which in this case are 646464646,645543333 & 874646553. After looking at few examples I know that they are normally in next line followed by a heading like Invoice# or Invoice Numbers etc.

I my trying to use Regular Expressions to solve this problem but I am not able to build a solution which can match a keyword like "Invoice#" in the header and extract numbers just below that header (there could be N number of rows in the table snapshot)

My desired output from this example is:

[646464646,645543333,874646553]

I tried searching for any existing solution but didn't find any example for a match in newline text, please suggest if you have an idea how to solve this.

Please let me know if further details are required. Thanks.

Edit: The example shown above is not the standard format this is just one of the emails, actual emails may have this snapshot in a different way like there could be more than 4 columns with different headers and names, also the invoice number could have more than or less than 9 digits, only consistent thing I believe is the "Invoice#" keyword in header.

Upvotes: 0

Views: 186

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

Try first splitting your input string/file on Invoice#, then use re.findall on the second entry in the list:

parts = input.split("Invoice#")
numbers = re.findall(r'(\d+)       (?:Open|Closed)', parts[1])

If you know for certain that all invoice numbers would always be 9 digits, then you may simplify the matching logic:

numbers = re.findall(r'\d{9}', parts[1])

Upvotes: 1

Related Questions