Reputation: 95
I have extracted the text from a Credit Card Statement, sliced for only the transactions within the entire month, and saved that within a list. The list is 404 strings long, and an excerpt of that list is shown below:
['Apr',
'27',
'UBER',
'TECHNOLOGIES',
'INC866-576-1039CA',
'$10.93',
'Apr',
'27',
'UBER',
'TECHNOLOGIES',
'INC866-576-1039CA',
'$11.38',
'Apr',
'28',
'COSTCO',
'WHSE',
'#1018HOUSTONTX',
'$105.02']
As you can see, the list follows a standard format where one transaction is comprised of 3 pieces: the Date (2 separate strings), the Description (2-5 strings), and the Amount (1 string).
My goal is to create a Pandas Dataframe based upon this information, but I am having trouble figuring out a way to manipulate the strings to properly segment each category (Date, Description, Amount). I believe I need to use some combination of a Regex and a loop to accomplish this.
Per comments, The raw data that is extracted from the credit card statement (PDF) is also show below:
" \nApr 27\nUBER TECHNOLOGIES\nINC866-576-1039CA\n$10.93\nApr 27\nUBER TECHNOLOGIES\nINC866-576-1039CA\n$11.38\nApr 28\nCOSTCO WHSE #1018HOUSTONTX\n$105.02\n"
Upvotes: 0
Views: 143
Reputation: 249113
You should simply work with the original raw text, which as per your comment is:
Apr 27
UBER TECHNOLOGIES
INC866-576-1039CA
$10.93
Apr 27
UBER TECHNOLOGIES
INC866-576-1039CA
$11.38
Apr 28
COSTCO WHSE #1018HOUSTONTX
$105.02
It appears that the format is:
DATE
VENDOR
[EXTRA INFO, OPTIONAL]
$AMOUNT
You can read the lines in a loop, something like this:
txs = []
for line in file:
date = pd.to_datetime(line)
vendor = next(file)
amount = next(file) # might not be amount yet
while not amount.startswith('$'):
vendor += amount
amount = next(file)
txs.append((date, vendor, amount))
# now create DataFrame
pd.DataFrame(txs)
The idea here is to read the file in chunks, with one iteration of the loop per transaction.
Upvotes: 2