Reputation: 1
I have to read the data from bank statement PDF which contains text and table.
I have tried some solutions provided over stack-overflow but getting errors for the most of them.
From many following one code worked for me but not getting expected results.
from tika import parser
rawText = parser.from_file('icici.pdf')
rawList = rawText['content'].splitlines()
print(rawList)
getting output as -
2020-06-29 13:05:31,177 [MainThread ] [WARNI] Failed to see startup log message; retrying...
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Statement_MAY2020_013625568.pdf', '', '', '346001506028??PAVA0101 444501', '', '', '', '']
But want the data from the PDF file, not about PDF file.
can someone provide solution for me to read the data from bank statement PDF?
Upvotes: 0
Views: 4809
Reputation: 1
import re
import PyPDF2
import pandas as pd
# Открываем PDF файл
pdf_file = open('ВТБ_Выписка_по_счёту.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Извлекаем текст из каждой страницы
data = []
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
text = page.extract_text()
# Обработка текста для извлечения данных (зависит от структуры текста)
# Пример 1 строки ВТБ: 24.12.2024
pattern_date = re.compile(r'(\d{2}\.\d{2}\.\d{4})')
# Пример 2 строки ВТБ: 13:14:4624.12.2024 -470.57 RUB 0.00 470.57 0.00 RUBОплата товаров и услуг. PYATEROCHKA 19704.
pattern_data = re.compile(r'(\d{2}:\d{2}:\d{2})(\d{2}\.\d{2}\.\d{4}) (-?[\d,.]+) (RUB) ([\d,.]+) ([\d,.]+) ([\d,.]+ RUB)(.+)')
lines = text.split('\n')
date = None
for line in lines: # Обработка строки и добавление данных в список data
match_date = pattern_date.match(line)
if match_date:
date, = match_date.groups()
match_data = pattern_data.match(line)
if match_data:
time, process_date, amount, currency, income, expense, commission, description = match_data.groups()
# У ВТБ разряды отделены запятой, а десятичные - точкой
amount = amount.replace(',', '').replace('.', ',')
income = income.replace(',', '').replace('.', ',')
expense = expense.replace(',', '').replace('.', ',')
else:
time = process_date = amount = currency = income = expense = commission = description = None
if date and time:
data.append([page_num + 1, date, time, process_date, amount, currency, income, expense, commission, description])
print(page_num, date, time, process_date, amount, currency, income, expense, commission, description)
# Закрываем файл
pdf_file.close()
# Создаем DataFrame из данных
df = pd.DataFrame(data, columns=['Лист', 'Дата операции' , 'Время операции', 'Дата обработки',
'Сумма операции', 'Валюта', 'Доход', 'Расход', 'Комиссия', 'Описание операции'])
# Сохраняем DataFrame в CSV
df.to_csv('ВТБ_Выписка_по_счёту.csv', index=False, encoding='utf-8-sig')
Upvotes: 0
Reputation: 1
df_list = read_pdf(filepath,stream=True,guess=True,pages='all',
multiple_tables=True,
pandas_options={
'header':None})
try this, This code worked for me using the tabula-py module.
Upvotes: 0