Reputation: 1
I'm trying to create a simple Python algorithm that reads tables from a pdf, return me dataframes so I can further concatenate and create a .xlsx file with them. However, due to the way the PDF is structured tabula is returning me dfs with different number of columns, thus preventing me of effectively concatenating them. Here's what I mean: enter image description here.
There are entire pages without any information on the "Parcelas" column, there are entire pages with no information on other columns too, so tabula just doesn't create it. See the simple code I wrote to check how the dfs were being created and the output I received:
import pandas as pd
import tabula
dfs = tabula.read_pdf('C:\\Users\\Samuel\\Desktop\\importante\\FATURAS\\Maio.pdf', pages='all', pandas_options={'header': None})
for df in dfs:
print(df.head(1))
And the output:
0 1 2 3
0 NaN SALDO FATURA ANTERIOR BR 2.428,72
0 1 2 3 4
0 13/05 HLXFORTALEZA NaN BR 43,00
0 1 2 3 4
0 18/05 MERCADO DA RACAO INDUS CAUCAIA NaN BR 136,00
0 1 2 3 4 5
0 23/04 PG *TON D&L V PARC FORTALEZA Parcela 02/02 BR 42,50 NaN
This is just a trial example as the pdf I'm writing the algorithm for has over two hundred pages so manually adjusting it is not an option.
Is there a way to get a uniform number of columns in all DataFrames? If not, how can I handle the variability in column numbers?
I don't know what to try, as a beginner in data science and coding I'm completely lost.
Upvotes: 0
Views: 40