Pavan Deshmukh
Pavan Deshmukh

Reputation: 1

How to read data from bank statement PDF in python?

I have to read the data from bank statement PDF which contains text and table.

I have tried some solutions provided over stack-overflow but getting errors for the most of them.

From many following one code worked for me but not getting expected results.

from tika import parser

rawText = parser.from_file('icici.pdf')

rawList = rawText['content'].splitlines()

print(rawList)

getting output as -

2020-06-29 13:05:31,177 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Statement_MAY2020_013625568.pdf', '', '', '346001506028??PAVA0101 444501', '', '', '', '']

But want the data from the PDF file, not about PDF file.

can someone provide solution for me to read the data from bank statement PDF?

Upvotes: 0

Views: 4809

Answers (2)

Agafia
Agafia

Reputation: 1

import re
import PyPDF2
import pandas as pd

# Открываем PDF файл
pdf_file = open('ВТБ_Выписка_по_счёту.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Извлекаем текст из каждой страницы
data = []
for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    text = page.extract_text()

    # Обработка текста для извлечения данных (зависит от структуры текста)
    # Пример 1 строки ВТБ: 24.12.2024
    pattern_date = re.compile(r'(\d{2}\.\d{2}\.\d{4})')
    # Пример 2 строки ВТБ: 13:14:4624.12.2024 -470.57 RUB 0.00 470.57 0.00 RUBОплата товаров и услуг. PYATEROCHKA 19704.
    pattern_data = re.compile(r'(\d{2}:\d{2}:\d{2})(\d{2}\.\d{2}\.\d{4}) (-?[\d,.]+) (RUB) ([\d,.]+) ([\d,.]+) ([\d,.]+ RUB)(.+)')
    lines = text.split('\n')
    date = None
    for line in lines:  # Обработка строки и добавление данных в список data
        match_date = pattern_date.match(line)
        if match_date:
            date, = match_date.groups()
        match_data = pattern_data.match(line)
        if match_data:
            time, process_date, amount, currency, income, expense, commission, description = match_data.groups()
            # У ВТБ разряды отделены запятой, а десятичные - точкой
            amount = amount.replace(',', '').replace('.', ',')
            income = income.replace(',', '').replace('.', ',')
            expense = expense.replace(',', '').replace('.', ',')
        else:
            time = process_date = amount = currency = income = expense = commission = description = None
        if date and time:
            data.append([page_num + 1, date, time, process_date, amount, currency, income, expense, commission, description])
            print(page_num, date, time, process_date, amount, currency, income, expense, commission, description)


# Закрываем файл
pdf_file.close()
# Создаем DataFrame из данных
df = pd.DataFrame(data, columns=['Лист', 'Дата операции' , 'Время операции', 'Дата обработки',
                                 'Сумма операции', 'Валюта', 'Доход', 'Расход', 'Комиссия', 'Описание операции'])
# Сохраняем DataFrame в CSV
df.to_csv('ВТБ_Выписка_по_счёту.csv', index=False, encoding='utf-8-sig')

Upvotes: 0

Nick
Nick

Reputation: 1

df_list = read_pdf(filepath,stream=True,guess=True,pages='all',
                            multiple_tables=True,
                            pandas_options={
                                'header':None})

try this, This code worked for me using the tabula-py module.

Upvotes: 0

Related Questions