Reputation: 195
I want to work with PDF files, specially with tables. I code this
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..\PDFs\Ala.pdf',encoding='latin-1', pages ='all')
tab
But I get a list of values, like this:
[ Nombres Edad Ciudad
0 Noelia 20 Lima
1 Michelie 45 Lima
2 Ximena 18 Lima
3 Miguel 43 Lima]
I cannot analyze it die it's not a data frame. This is just an example the real PDF file contains tables between texts and several pages
So, please could someone help me with this issue?
Upvotes: 5
Views: 8171
Reputation: 178
tabula returns a list of Pandas DataFrame. But we can convert this list to Pandas DataFrame using the below statement.
import tabula
import pandas
tab = pandas.DataFrame(tabula.read_pdf('..\PDFs\Ala.pdf', pages ='all')[0])
Upvotes: 1
Reputation: 46789
tabula
should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
dfs = tabula.read_pdf('..\PDFs\Ala.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")
# display each of the dataframes
for df in dfs:
print(df.size)
print(df)
Upvotes: 5