M_Arora
M_Arora

Reputation: 133

Search for a partial string in text and extract numbers from lines above and below the matched pattern

I am trying to write a regular expression that will search based on a string and if it founds even a partial match. I can get extract numbers from lines (2 lines) above and below the matched string or substring.

My text is:

Subtotal AED1,232.20
AED61.61
VAT
5 % Tax:
RECEIPT TOTAL: AED1.293.81

I wish to search for the word VAT and extract all numbers from two lines above and below it.

Expected output:

AED1,232.20
AED61.61
5 % 
AED1.293.81

I am able to extract the entire content but I need the numbers, AED can be dropped or ignored.

My regex is:

((.*\n){2}).*vat(.*\n.*\n.*)

Thanks in advance!

Upvotes: 1

Views: 138

Answers (3)

SeaBean
SeaBean

Reputation: 23217

This regex is tailor-made for your input text and expected output:

r'.* (AED\d{1,3}(?:,\d{3})*\.\d{2})\n(AED\d{1,3}(?:,\d{3})*\.\d{2})\nVAT\n(\d{1,2} %) Tax:\n.* (AED\d{1,3}(?:,\d{3})*\.\d{2})'

Your required Regex

It outputs exactly the text you want, without extra words.

It also works with more than one "VAT" in your input text.

Regex Logics:

  • (AED\d{1,3}(?:,\d{3})*\.\d{2}) Match currency code and amount (in one group)
  • (\d{1,2} %) Match VAT %. Supports 1 to 2 digits. You can further enhance it to support decimal point.

Note that the proper regex for currency amount (with comma as thousand separator and exactly 2 decimal points) should be as follows:

r'\d{1,3}(?:,\d{3})*\.\d{2}'

[with (?: expr) to indicate nontagged group so that this subgroup will not be tagged as a match for your re.findall function call.]

In case your input supports currency codes other than 'AED', you can replace 'AED' with [A-Z]{3} as currency code should normally be in 3-character capital letters.

Upvotes: 0

Leonardo Scotti
Leonardo Scotti

Reputation: 1080

try this:

(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\nVAT\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*).*\n[^0-9]*(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)

This regex can seem too complex or long, but it has better control and returns only numbers, it will be his work.

Regex Demo

Upvotes: 3

anubhava
anubhava

Reputation: 784998

You may use this regex in python:

((?:^.*\d.*\n){0,2})VAT((?:\n.*\d.*){0,2})

RegEx Demo

RegEx Details:

  • ((?:^.*\d.*\n){0,2}): Match 2 leading lines that must contain at least a digit
  • VAT: match text VAT
  • ((?:\n.*\d.*){0,2}): Match 2 trailing lines that must contain at least a digit

Upvotes: 2

Related Questions