Reputation: 133
I am trying to write a regular expression that will search based on a string and if it founds even a partial match. I can get extract numbers from lines (2 lines) above and below the matched string or substring.
My text is:
Subtotal AED1,232.20
AED61.61
VAT
5 % Tax:
RECEIPT TOTAL: AED1.293.81
I wish to search for the word VAT
and extract all numbers from two lines above and below it.
Expected output:
AED1,232.20
AED61.61
5 %
AED1.293.81
I am able to extract the entire content but I need the numbers, AED can be dropped or ignored.
My regex is:
((.*\n){2}).*vat(.*\n.*\n.*)
Thanks in advance!
Upvotes: 1
Views: 138
Reputation: 23217
This regex is tailor-made for your input text and expected output:
r'.* (AED\d{1,3}(?:,\d{3})*\.\d{2})\n(AED\d{1,3}(?:,\d{3})*\.\d{2})\nVAT\n(\d{1,2} %) Tax:\n.* (AED\d{1,3}(?:,\d{3})*\.\d{2})'
It outputs exactly the text you want, without extra words.
It also works with more than one "VAT" in your input text.
(AED\d{1,3}(?:,\d{3})*\.\d{2})
Match currency code and amount (in one group)(\d{1,2} %)
Match VAT %. Supports 1 to 2 digits. You can further enhance it to support decimal point.Note that the proper regex for currency amount (with comma as thousand separator and exactly 2 decimal points) should be as follows:
r'\d{1,3}(?:,\d{3})*\.\d{2}'
[with (?: expr) to indicate nontagged group so that this subgroup will not be tagged as a match for your re.findall function call.]
In case your input supports currency codes other than 'AED', you can replace 'AED' with [A-Z]{3} as currency code should normally be in 3-character capital letters.
Upvotes: 0
Reputation: 1080
try this:
(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\nVAT\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*).*\n[^0-9]*(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)
This regex can seem too complex or long, but it has better control and returns only numbers, it will be his work.
Upvotes: 3
Reputation: 784998
You may use this regex in python
:
((?:^.*\d.*\n){0,2})VAT((?:\n.*\d.*){0,2})
RegEx Details:
((?:^.*\d.*\n){0,2})
: Match 2 leading lines that must contain at least a digitVAT
: match text VAT
((?:\n.*\d.*){0,2})
: Match 2 trailing lines that must contain at least a digitUpvotes: 2