Havishaa Sharma
Havishaa Sharma

Reputation: 67

skipping a match in regex

I am trying to extract some number value from a text. Skipping is done based on a matching text. For example :

      Input Text - 
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.
      OR
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.
      OR
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%.

      Output Text -
      Amount 400.00
      GST 36479
      GST 20%

Main point is input text can be in any format but output text should be same. One thing that will be same is GST Number will be non-decimal number, GST percentage will be number followed by "%" symbol and amount will be in decimal form.

I tried but not able to skip the non-numeric value after GST. Please help.

What I tried :

              pattern = re.compile(r"\b(?<=GST).\D(\d+)") 

Upvotes: 1

Views: 225

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627327

You can use

\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)

See the regex demo. Details:

  • \bAmount\s* - a whole word Amount and zero or more whitespaces
  • (?P<amount>\d+(?:\.\d+)?) - Group "amount": one or more digits and then an optional sequence of . and one or more digits
  • .*? - some text (excluding whitespace)
  • \bGST - a word GST
  • \D* - zero or more chars other than digits
  • (?P<gst_id>\d+(?:\.\d+)?) - Group "gst_id": one or more digits and then an optional sequence of . and one or more digits
  • .*? - some text (excluding whitespace)
  • \bGST\D* - a word GST and then zero or more chars other than digits
  • (?P<gst_prcnt>\d+(?:\.\d+)?%) - Group "gst_prcnt": one or more digits and then an optional sequence of . and one or more digits, and then a % char.

See the Python demo:

import re
pattern = r"\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)"

texts = ["ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%."]

for text in texts:
    m = re.search(pattern, text)
    if m:
        print(m.groupdict())

Output:

{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}

Upvotes: 2

Related Questions