Soyam Maharathy
Soyam Maharathy

Reputation: 73

Extracting information from text using Python regex

I have a text:

text = 'dear customer your account xx9052 has been debited with inr25697.50 on 23-nov-18 info 
bil001582495861 icici bank the available balance is inr 363.25'

Here, I am trying to extract information like account number, amount, date and available balance from the text.

I tried this by the following regex:

pattern = 'your account (.*) has been debited with (.*) on (.*) info (.*) available balance is (.*\d)$'

if (re.search(pattern, text, re.IGNORECASE)):
    print(re.search(pattern, text, re.IGNORECASE).group(1)), \
    print(re.search(pattern, text, re.IGNORECASE).group(2)), \
    print(re.search(pattern, text, re.IGNORECASE).group(3)), \
    print(re.search(pattern, text, re.IGNORECASE).group(5))

I got the desired results:

xx9333
inr 25697.50
23-nov-18
inr 363.25

but I am facing the issue with this regex pattern, when the text is slightly modified:

text = 'dear customer your account xx9052 has been debited with inr 25697.50 on 23-nov-18 info bil 001582495861 icici bank the available balance is inr 363.25 for dispute call 04033667777'

Using the same regex gives me result:

xx9333
inr 25697.50
23-nov-18
inr 363.25 for dispute call 04033667777

balance is extracted with extra information while it should be only inr 363.25.. How can I resolve this issue so information is correctly extracted in both cases using a single pattern?

Upvotes: 2

Views: 790

Answers (3)

Ragu Natarajan
Ragu Natarajan

Reputation: 739

Input text

text = 'dear customer your account xx9052 has been debited with inr 25697.50 on 23-nov-18 info bil 001582495861 icici bank the available balance is inr 363.25 for dispute call 04033667777'

Using the below regex:

r('your account (.*) has been debited with (.*) on (.*) info bil (.*) icici bank the available balance is (.*[\d]+\.[\d]+)')

Output:

xx9052
inr 25697.50
23-nov-18
001582495861
inr 363.25

Upvotes: 1

Anicet Rakotonirina
Anicet Rakotonirina

Reputation: 66

The pattern:

(.*\d)$

is going to match any string ending with decimals, so in that case it returns the string up to that phone number at the end. If possible, you could try making the pattern a bit more specific, for example creating a pattern that would include the "inr", or maybe getting all the numbers separately, for example by using:

re.findall('\d*\.?\d+',text)

which will return a list of numbers that have digits before and after a decimal point.

source: https://www.tutorialspoint.com/Extract-decimal-numbers-from-a-string-in-Python

Upvotes: 2

stud3nt
stud3nt

Reputation: 2153

I'd suggest to extract the piece of information separately instead of using a single pattern.

For example: To fetch amount you can use the regex pattern - ([\d]+\.[\d]+) It will fetch decimal numbers from a required string and you can so on create regex for other information like account number and date.

Update:
If you want to use the same template then change your regex to

pattern = 'your account (.*) has been debited with (.*) on (.*) info (.*) available balance is (.*[\d]+\.[\d]+)'

Upvotes: 3

Related Questions