user25341380
user25341380

Reputation: 1

I'm having problems with null columns while extracting a dataframe from a pdf with tabula

I'm trying to create a simple Python algorithm that reads tables from a pdf, return me dataframes so I can further concatenate and create a .xlsx file with them. However, due to the way the PDF is structured tabula is returning me dfs with different number of columns, thus preventing me of effectively concatenating them. Here's what I mean: enter image description here.

There are entire pages without any information on the "Parcelas" column, there are entire pages with no information on other columns too, so tabula just doesn't create it. See the simple code I wrote to check how the dfs were being created and the output I received:

import pandas as pd
import tabula

dfs = tabula.read_pdf('C:\\Users\\Samuel\\Desktop\\importante\\FATURAS\\Maio.pdf', pages='all', pandas_options={'header': None})

for df in dfs:
  print(df.head(1))

And the output:

     0                      1   2         3
0  NaN  SALDO FATURA ANTERIOR  BR  2.428,72
       0             1   2   3      4
0  13/05  HLXFORTALEZA NaN  BR  43,00
       0                               1    2   3       4
0  18/05  MERCADO DA RACAO INDUS CAUCAIA  NaN  BR  136,00
       0                              1              2   3      4    5
0  23/04  PG *TON D&L V PARC  FORTALEZA  Parcela 02/02  BR  42,50  NaN

This is just a trial example as the pdf I'm writing the algorithm for has over two hundred pages so manually adjusting it is not an option.

Is there a way to get a uniform number of columns in all DataFrames? If not, how can I handle the variability in column numbers?

I don't know what to try, as a beginner in data science and coding I'm completely lost.

Upvotes: 0

Views: 40

Answers (0)

Related Questions