Maria Fernanda
Maria Fernanda

Reputation: 195

Using tabula-py why I get a list and not a Dataframe?

Output

I want to work with PDF files, specially with tables. I code this

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..\PDFs\Ala.pdf',encoding='latin-1', pages ='all')
tab

But I get a list of values, like this:

[    Nombres  Edad Ciudad
0    Noelia    20   Lima
1  Michelie    45   Lima
2    Ximena    18   Lima
3    Miguel    43   Lima]

I cannot analyze it die it's not a data frame. This is just an example the real PDF file contains tables between texts and several pages

So, please could someone help me with this issue?

Upvotes: 5

Views: 8171

Answers (2)

Divyansh Gemini
Divyansh Gemini

Reputation: 178

tabula returns a list of Pandas DataFrame. But we can convert this list to Pandas DataFrame using the below statement.

import tabula
import pandas

tab = pandas.DataFrame(tabula.read_pdf('..\PDFs\Ala.pdf', pages ='all')[0])

Upvotes: 1

Martin Evans
Martin Evans

Reputation: 46789

tabula should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf

dfs = tabula.read_pdf('..\PDFs\Ala.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")

# display each of the dataframes
for df in dfs:
    print(df.size)
    print(df)

Upvotes: 5

Related Questions