Using tabula-py why I get a list and not a Dataframe?

Question

I want to work with PDF files, specially with tables. I code this

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..\PDFs\Ala.pdf',encoding='latin-1', pages ='all')
tab

But I get a list of values, like this:

[    Nombres  Edad Ciudad
0    Noelia    20   Lima
1  Michelie    45   Lima
2    Ximena    18   Lima
3    Miguel    43   Lima]

I cannot analyze it die it's not a data frame. This is just an example the real PDF file contains tables between texts and several pages

So, please could someone help me with this issue?

Martin Evans · Accepted Answer

tabula should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf

dfs = tabula.read_pdf('..\PDFs\Ala.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")

# display each of the dataframes
for df in dfs:
    print(df.size)
    print(df)

Using tabula-py why I get a list and not a Dataframe?

Answers (2)

Related Questions