Reputation: 1640
I am working on a problem where I have some Multi line strings which are in a table type email snapshot format.
Example below:
Hello,
please provide an update on the following invoice
Invoice# Status Invoice_Amount Account#
646464646 Open 7446.00 53334444
645543333 Open 6443.00 23599499
874646553 Open 6223.50 94744663
Thanks,
My task is to extract the Invoice numbers which in this case are 646464646,645543333 & 874646553. After looking at few examples I know that they are normally in next line followed by a heading like Invoice# or Invoice Numbers etc.
I my trying to use Regular Expressions to solve this problem but I am not able to build a solution which can match a keyword like "Invoice#" in the header and extract numbers just below that header (there could be N number of rows in the table snapshot)
My desired output from this example is:
[646464646,645543333,874646553]
I tried searching for any existing solution but didn't find any example for a match in newline text, please suggest if you have an idea how to solve this.
Please let me know if further details are required. Thanks.
Edit: The example shown above is not the standard format this is just one of the emails, actual emails may have this snapshot in a different way like there could be more than 4 columns with different headers and names, also the invoice number could have more than or less than 9 digits, only consistent thing I believe is the "Invoice#" keyword in header.
Upvotes: 0
Views: 186
Reputation: 521259
Try first splitting your input string/file on Invoice#
, then use re.findall
on the second entry in the list:
parts = input.split("Invoice#")
numbers = re.findall(r'(\d+) (?:Open|Closed)', parts[1])
If you know for certain that all invoice numbers would always be 9 digits, then you may simplify the matching logic:
numbers = re.findall(r'\d{9}', parts[1])
Upvotes: 1