shekwo
shekwo

Reputation: 1447

Accessing indexes in a list

I am using tabula-py to extract a table from a pdf document like this:

rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True) 

rows

This gives an output like so:

[                                                   0
 0  Customer Statement\rxxxxxxx\rP...
 1  Print Date: April 12, 2020Address: 41 BAALE ST...
 2  Period: January 1, 2020 ­ April 12, 2020Openin...,
                                                    0
 0  Customer Statement\xxxxxxxx\rP...
 1  Print Date: April 12, 2020Address: 41 gg ST...,
              0          1            2          3          4          5  \
 0  03­Jan­2020          0  03­Jan­2020        NaN  50,000.00  52,064.00   
 1  10­Jan­2020          0  10­Jan­2020  25,000.00        NaN  27,064.00   
 2  10­Jan­2020          0  10­Jan­2020      25.00        NaN  27,039.00   
 3  10­Jan­2020          0  10­Jan­2020       1.25        NaN  27,037.75   
 4  20­Jan­2020  999921...  20­Jan­2020  10,000.00        NaN  17,037.75   
 5  23­Jan­2020  999984...  23­Jan­2020   4,050.00        NaN  12,987.75   
 6  23­Jan­2020          0  23­Jan­2020   1,000.00        NaN  11,987.75   
 7  24­Jan­2020          0  24­Jan­2020   2,000.00        NaN   9,987.75   
 8  24­Jan­2020          0  24­Jan­2020        NaN  30,000.00  39,987.75   

                                                    6  
 0  TRANSFER BETWEEN\rCUSTOMERS Via GG from\r...  
 1  NS Instant Payment Outward\r000013200110121...  
 2  COMMISSION\r0000132001101218050000326...\rNIP ...  
 3     VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001  
 4  CASH WITHDRAWAL FROM\rOTHER ATM ­210674­ ­4420...  
 5  POS/WEB PURCHASE\rTRANSACTION ­845061­\r­80405...  
 6  Airtime Purchase MBANKING­\r101CT0000000001551...  
 7  Airtime Purchase MBANKING­\r101CT0000000001552...  
 8  TRANSFER BETWEEN\rCUSTOMERS\r00001520012412113...  ,

What I want from this pdf starts from index 2. So I run

rows[2]

And I get a dataframe that looks like this:

enter image description here

Now, I want indexes from 2 till the last index. I did

rows[2:]

But I am getting a list and not the expected dataframe.

[             0          1            2          3          4          5  \
 0  03­Jan­2020          0  03­Jan­2020        NaN  50,000.00  52,064.00   
 1  10­Jan­2020          0  10­Jan­2020  25,000.00        NaN  27,064.00   
 2  10­Jan­2020          0  10­Jan­2020      25.00        NaN  27,039.00   
 3  10­Jan­2020          0  10­Jan­2020       1.25        NaN  27,037.75   
 4  20­Jan­2020  999921...  20­Jan­2020  10,000.00        NaN  17,037.75   
 5  23­Jan­2020  999984...  23­Jan­2020   4,050.00        NaN  12,987.75   
 6  23­Jan­2020          0  23­Jan­2020   1,000.00        NaN  11,987.75   
 7  24­Jan­2020          0  24­Jan­2020   2,000.00        NaN   9,987.75   
 8  24­Jan­2020          0  24­Jan­2020        NaN  30,000.00  39,987.75   

                                                    6  
 0  TRANSFER BETWEEN\rCUSTOMERS Via gg from\r...  
 1  bi Instant Payment Outward\r000013200110121...  
 2  COMMISSION\r0000132001101218050000326...\rNIP ...  
 3     VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001  
 4  CASH WITHDRAWAL FROM\rOTHER ATM ­210674­ ­4420...  
 5  POS/WEB PURCHASE\rTRANSACTION ­845061­\r­80405...

Please do I solve this? I need a dataframe for indexes starting at 2 and onwards.

Upvotes: 1

Views: 317

Answers (2)

You are getting this behaviour because rows is a list and slicing a list produces another list. When you access an element at a specific index, you get the object at that index; in this case, a DataFrame object.

The pandas library ships with a concat function that can combine multiple DataFrame objects into one -- I believe this is what you want to do -- such that you have:

import pandas as pd


df_combo = pd.concat([rows[2], rows[3], rows[4], rows[5] ...])

Even better:

df_combo = pd.concat(rows[2:])

Upvotes: 1

Peter Odetayo
Peter Odetayo

Reputation: 169

Take a look at https://medium.com/analytics-vidhya/how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673

The best way to go about what you're trying to achieve is by reading the table and returning the response as JSON, loop through the json objects for your lists.

Upvotes: 0

Related Questions