Reputation: 67
I am trying to extract some number value from a text. Skipping is done based on a matching text. For example :
Input Text -
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.
OR
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.
OR
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%.
Output Text -
Amount 400.00
GST 36479
GST 20%
Main point is input text can be in any format but output text should be same. One thing that will be same is GST Number will be non-decimal number, GST percentage will be number followed by "%" symbol and amount will be in decimal form.
I tried but not able to skip the non-numeric value after GST. Please help.
What I tried :
pattern = re.compile(r"\b(?<=GST).\D(\d+)")
Upvotes: 1
Views: 225
Reputation: 627327
You can use
\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)
See the regex demo. Details:
\bAmount\s*
- a whole word Amount
and zero or more whitespaces(?P<amount>\d+(?:\.\d+)?)
- Group "amount": one or more digits and then an optional sequence of .
and one or more digits.*?
- some text (excluding whitespace)\bGST
- a word GST
\D*
- zero or more chars other than digits(?P<gst_id>\d+(?:\.\d+)?)
- Group "gst_id": one or more digits and then an optional sequence of .
and one or more digits.*?
- some text (excluding whitespace)\bGST\D*
- a word GST
and then zero or more chars other than digits(?P<gst_prcnt>\d+(?:\.\d+)?%)
- Group "gst_prcnt": one or more digits and then an optional sequence of .
and one or more digits, and then a %
char.See the Python demo:
import re
pattern = r"\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)"
texts = ["ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%."]
for text in texts:
m = re.search(pattern, text)
if m:
print(m.groupdict())
Output:
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
Upvotes: 2