Reputation: 111
I have a list named as Statement created from pdf using pytesseract and Regex:
Statement= ['07-10-2019 UPI/927912685773/UPI/surya.balaji94@/Citibank 6,677.00 2,36,804.08',
'07-10-2019 MOBILE BANKING DUT/CITIBANK 3,403.00 2,40,207.08',
'07-10-2019 BIL/INFT/001818195728/82D3/ AJAY KUMAR JHA 6,080.00 2,46, 287.08',
'08.10.2019 MOBILE BANKING MMM TiMPS/928115182374/8161 Oct Mte/AMARJEET SIHDFC 4,411.00 250,698.08',
'08-10-2019 BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P 6,599.00 2,57,297.08']
With some help on stack, I created a list of dictionaries as follows:
cols = ["Date", "Item_Name", "Transaction_Amount", "Balance"]
date_pattern = re.compile(r"\d{2}[- /.]\d{2}[- /.]\d{4}", re.I)
item_and_name_pattern = re.compile(r"(?<=\d{2}-\d{2}-\d{4}\s).*", re.I)
amount_pattern = re.compile(r"\d+,\d+.\d+", re.I)
total_pattern = re.compile(r"\d+,\d+,\d+.\d+$", re.I)
Transaction = namedtuple("Transaction", cols)
transactions = []
for item in Statement:
try:
date = re.search(date_pattern, item).group()
total = re.search(total_pattern, item).group()
temp_1 = item.rstrip(total)
amount = re.search(amount_pattern, item).group()
temp_2 = temp_1.strip().rstrip(amount)
item_and_name = re.search(item_and_name_pattern, temp_2).group()
except:
pass
t = Transaction(date, item_and_name, amount, total)
transactions.append(t)
out = [{k:v for k, v in f._asdict().items()} for f in transactions]
But the output is not satisfactory as it is picking up the date but going wrong with item name and total etc for that date(check the list above and match with the dictionaries below). I want to know if there is any other way to store them in named columns correctly?
[{'Date': '07-10-2019',
'Item_Name': 'UPI/927912685773/UPI/surya.balaji94@/Citibank ',
'Transaction_Amount': '6,677.00',
'Balance': '2,36,804.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08.10.2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08-10-2019',
'Item_Name': 'BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P ',
'Transaction_Amount': '6,599.00',
'Balance': '2,57,297.08'}]
Upvotes: 1
Views: 167
Reputation: 1266
Here is a simpler way:
import re
pattern = re.compile("(?P<Date>\d{2}[.-]\d{2}[.-]\d{4})\s(?P<Item_Name>.+)\s(?P<Transaction_Amount>[0-9,\.]+)\s(?P<Balance>[0-9,\.]+)")
print([pattern.match(item).groupdict() for item in Statement])
EDIT: If using try-except as requested in the comments:
result = []
for item in Statement:
try:
result.append(pattern.match(item).groupdict())
except AttributeError:
pass
print(result)
Upvotes: 1