Reputation: 1447
I am using tabula-py to extract a table from a pdf document like this:
rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True)
rows
This gives an output like so:
[ 0
0 Customer Statement\rxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 BAALE ST...
2 Period: January 1, 2020 April 12, 2020Openin...,
0
0 Customer Statement\xxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 gg ST...,
0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via GG from\r...
1 NS Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
6 Airtime Purchase MBANKING\r101CT0000000001551...
7 Airtime Purchase MBANKING\r101CT0000000001552...
8 TRANSFER BETWEEN\rCUSTOMERS\r00001520012412113... ,
What I want from this pdf starts from index 2. So I run
rows[2]
And I get a dataframe that looks like this:
Now, I want indexes from 2 till the last index. I did
rows[2:]
But I am getting a list and not the expected dataframe.
[ 0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via gg from\r...
1 bi Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
Please do I solve this? I need a dataframe for indexes starting at 2 and onwards.
Upvotes: 1
Views: 317
Reputation: 1077
You are getting this behaviour because rows
is a list
and slicing a list produces another list. When you access an element at a specific index, you get the object at that index; in this case, a DataFrame object.
The pandas library ships with a concat function that can combine multiple DataFrame
objects into one -- I believe this is what you want to do -- such that you have:
import pandas as pd
df_combo = pd.concat([rows[2], rows[3], rows[4], rows[5] ...])
Even better:
df_combo = pd.concat(rows[2:])
Upvotes: 1
Reputation: 169
Take a look at https://medium.com/analytics-vidhya/how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673
The best way to go about what you're trying to achieve is by reading the table and returning the response as JSON, loop through the json objects for your lists.
Upvotes: 0