String Manipulation to Dataframe

Question

I have extracted the text from a Credit Card Statement, sliced for only the transactions within the entire month, and saved that within a list. The list is 404 strings long, and an excerpt of that list is shown below:

['Apr',
 '27',
 'UBER',
 'TECHNOLOGIES',
 'INC866-576-1039CA',
 '$10.93',
 'Apr',
 '27',
 'UBER',
 'TECHNOLOGIES',
 'INC866-576-1039CA',
 '$11.38',
 'Apr',
 '28',
 'COSTCO',
 'WHSE',
 '#1018HOUSTONTX',
 '$105.02']

As you can see, the list follows a standard format where one transaction is comprised of 3 pieces: the Date (2 separate strings), the Description (2-5 strings), and the Amount (1 string).

My goal is to create a Pandas Dataframe based upon this information, but I am having trouble figuring out a way to manipulate the strings to properly segment each category (Date, Description, Amount). I believe I need to use some combination of a Regex and a loop to accomplish this.

Per comments, The raw data that is extracted from the credit card statement (PDF) is also show below:

" Apr 27 UBER TECHNOLOGIES INC866-576-1039CA $10.93 Apr 27 UBER TECHNOLOGIES INC866-576-1039CA $11.38 Apr 28 COSTCO WHSE #1018HOUSTONTX $105.02 "

John Zwinck · Accepted Answer

You should simply work with the original raw text, which as per your comment is:

Apr 27
UBER TECHNOLOGIES
INC866-576-1039CA
$10.93
Apr 27
UBER TECHNOLOGIES
INC866-576-1039CA
$11.38
Apr 28
COSTCO WHSE #1018HOUSTONTX
$105.02

It appears that the format is:

DATE
VENDOR
[EXTRA INFO, OPTIONAL]
$AMOUNT

You can read the lines in a loop, something like this:

txs = []
for line in file:
    date = pd.to_datetime(line)
    vendor = next(file)
    amount = next(file) # might not be amount yet
    while not amount.startswith('$'):
        vendor += amount
        amount = next(file)
    txs.append((date, vendor, amount))

# now create DataFrame
pd.DataFrame(txs)

The idea here is to read the file in chunks, with one iteration of the loop per transaction.

String Manipulation to Dataframe

Answers (1)

Related Questions